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Abstract 

In this technical report, we consider conditional density estimation with a maximum like- 
lihood approach. Under weak assumptions, we obtain a theoretical bound for a KuUback- 
Leibler type loss for a single model maximum likelihood estimate. We use a penalized model 
selection technique to select a best model within a collection. We give a general condition 
on penalty choice that leads to oracle type inequality for the resulting estimate. This con- 
struction is applied to two examples of partition-based conditional density models, models 
in which the conditional density depends only in a piecewise manner from the covariate. The 
first example relies on classical piecewise polynomial densities while the second uses Gaus- 
sian mixtures with varying mixing proportion but same mixture components. We show how 
this last case is related to an unsupervised segmentation application that has been the source 
of our motivation to this study. 



1 Introduction 

Assume we observe n pairs {{Xi, ii))i<j<„ of random variables, we are interested in estimating 
the law of the second variable Yi G y conditionally to the first one Xi G X. In this paper, 
wc assume that the pairs {Xi,Yi) arc independent while Yi depends on Xi through its law. 
More precisely, we assume that the covariates Xi are independent but not necessarily identically 
distributed. Assumptions on the l^s are stronger: we assume that, conditionally to the XjS, 
they arc independents and each variable Yi follows a law with density S{){-\Xi) with respect to 
a common known measure dA. Our goal is to estimate this two-variables conditional density 
function So('l') from the observations. 

This problem has been introduced by Rosenblatt [42] in the late 60's. He considered a 
stationary framework in which SQ{y\x) is linked to the supposed existing densities Sqi{x) and 
SQii{x,y) of respectively X^ and (Xj,!^) by 

SO' [x] 

and proposed a plugin estimate based on kernel estimation of both sqi{x) and So"(a;,y). Few 
other references on this subject seem to exist before the mid 90's with a study of a spline tensor 
based maximum hkehhood estimator proposed by Stone [44] and a bias correction of Rosenblatt's 
estimator due to Hyndman et al. [31]. 



Kernel based method have been much studied since. For instance, Fan et al. [22] and de Gooi- 
jer and Zerom [17] consider local polynomial estimator, Hall et al. [26] study a locally logistic 
estimator that is later extended by Hyndman and Yao [30]. In this setting, pointwisc convergence 
properties are considered, and extensions to dependent data are often obtained. The results de- 
pend however on a critical bandwidth that should be chosen according to the regularity of the 
unknown conditional density. Its practical choice is rarely discussed with the notable exceptions 
of Bashtannyk and Hyndman [5], Fan and Yim [21] and Hall et al. [27]. Extensions to cen- 
sored cases have also been discussed for instance by van Keilegom and Veraverbeke [48]. See for 
instance Li and Racine [36] for a comprehensive review of this topic. 

In the approach of Stone [44], the conditional density is estimated through a parametrized 
modelization. This idea has been reused since by Gyorfi and Kohler [25] with a histogram based 
approach, by Efromovich [19, 20] with a Fourier basis, and by Brunei ct al. [13] and Akakpo 
and Lacour [2] with piecewise polynomial representation. Those authors are able to control 
an integrated estimation error: with an integrated total variation loss for the first one and a 
quadratic distance loss for the others. Furthermore, in the quadratic framework, they manage 
to construct adaptive estimators, estimators that do not require the knowledge of the regularity 
to be minimax optimal (up to a logarithmic factor), using respectively a blockwise attenuation 
principle and a model selection by penalization approach. Note that Brunei et al. [13] extend 
their result to censored cases while Akakpo and Lacour [2] are able to consider weakly dependent 
data. 

In this paper, we consider a direct estimation of the conditional density function through a 
maximum likelihood approach. Although natural, this approach has been considered so far only 
by Stone [44] as mentioned before and by Blanchard et al. [11] in a classification setting with 
histogram type estimators. Assume we have a set S„i of candidate conditional densities, our 
estimate will be simply the maximum likelihood estimate 



Although this estimator may look like a maximum likelihood estimator of the joint density of 
{Xi, Yi), it does not generally coincide, even when the XiS are assumed to be i.i.d., with such an 
estimator as every function of Sm is assumed to be a conditional density and not a density. The 
only exceptions arc when the XiS arc assumed in the model to be i.i.d. uniform or non random 
and equal. Our aim is then to analyze the finite sample performance of such an estimator in term 
of KuUback-Leibler type loss. As often, a trade-off between a bias term measuring the closeness 
of .So to the set Sm and a variance term depending on the complexity of the set Sm and on the 
sample size appears. A good set Sm will be thus one for which this trade-off leads to a small risk 
bound. Using a penalized model selection approach, we propose then a way to select the best 
model among a collection S = {Sm.)meM- For a given family of penalties pen(m), we define 
the best model S^ as the one that minimized 



The main result of this paper is a sufficient condition on the penalty pcn(m) such that for any 
density function sq and any sample size n the adaptive estimate performs almost as well as 
the best one in the family {'sm}meM- 

The very frequent use of conditional density estimation in econometrics, sec Li and Racine 
[36] for instance, could have provided a sufficient motivation for this study. However it turns 
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out that this work stems from a completely different subject: unsupervised hyperspectral image 
segmentation. Using the synchrotron beam of Soleil, the IPANEMA platform [6], for which 
one of the author works, is able to acquire high quality hyperspectral images, high resolution 
images for which a spectrum is measured at each pixel location. This provides rapidly a huge 
amount of data for which an automatic processing is almost necessary. One of this processing is 
the segmentation of these images into homogeneous zones, so that the spectral analysis can be 
performed on fewer places and the geometrical structures can be exhibited. The most classical 
unsupervised classification method relies on the density estimation of Gaussian mixture by a 
maximum likelihood principle. The component of the estimated mixtures will correspond to 
classes. In the spirit of Kolaczyk et al. [34] and Antoniadis et al. [3], we have extended this 
method by taking into account the localization of the pixel in the mixture weight, going thus 
from density estimation to conditional density estimation. As stressed by Maugis and Michel 
[39], understanding finely the density estimator is crucial to be able to select the right number of 
classes. This theoretical work has been motivated by a similar issue for the conditional density 
estimation case. 

Section 2 is devoted to the analysis of the maximum likelihood estimation in a single model. 
It starts by Section 2.1 in which the setting and some notations are given. The risk of the max- 
imum likelihood in the classical case of misspecified parametric model is recalled in Section 2.2. 
Section 2.3 provides some tools required for the extension of this analysis to more general setting 
presented in Section 2.4. We focus then in 3 to the multiple model case. The penalty used is 
described in Section 3.1 while the main theorem is given in Section 3.2. Section 4 introduces 
partition-based conditional density estimator: we use model in which the conditional density de- 
pends from the covariate only in a piecewise constant manner. We study in details two instances 
of such model: one in which, conditionally to the covariate, the densities are piecewise polynomial 
for the Y variable and the other, which corresponds to our hyperspectral image segmentation 
motivation, in which, again conditionally to the covariate, the densities are Gaussian mixtures 
with the same mixture components but different mixture weights. 

2 Single model maximum likelihood estimate 
2.1 Framework and notation 

Our statistical framework is the following: we observe n independent pairs ((Xi,Yi))-^^^^^^^ e 
{X,y)" where the Xi's are independent, but not necessarily of the same law, and, conditionally 
to Xi, each Yi is a random variable of unknown conditional density SQ(-\Xi) with respect to a 
known reference measure dA. For any model Sm, a set comprising some candidate conditional 
densities, we estimate sq by the conditional density that maximizes the likelihood (condi- 
tionally to (-'^i)i<j<„) or equivalently that minimizes the opposite of the log-likelihood, denoted 
-log-likelihood from now on: 



To avoid existence issue, we should work with almost minimizer of this quantity and define a rj 
-log-likelihood minimizer as any that satisfies 



Sm = argmm 
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We should now specify our goodness criterion. We are working with a maximum hkehhood 
approach, the most natural quality measure is thus the Kullback-Leibler divergence KL. As we 
consider law with densities with respect to the known measure dA, we use the following notation 



KLx{s,t) = ia(sdA,MA) = 



-^ln(f)sdA ifsdA«;MA 
+00 otherwise 



where sdA ^ MA means <^ VQ' C f^, Jq/ MA = J^, sdA = 0. Remark that, contrary to 

the quadratic loss, this divergence is an intrinsic quality measure between probability laws: it 
does not depend on the reference measure dA. However, The densities depend on this reference 
measure, this is stressed by the index A when we work with the non intrinsic densities instead 
of the probability measures. As we deal with conditional densities and not classical densities, 
the previous divergence should be adapted. To take into account the structure of conditional 
densities and the design of (Xi)i<j<„, we use the following tensorized divergence: 



KLf"{s,t)=E 



1 " 



n 

i=l 



This divergence appears as the natural one in this setting and reduces to classical ones in specific 
settings: 

• If the law of Yi is independent of Xi, that is s(-|Xj) = s(-) and t{-\Xi) = t{-) do not depend 
on Xi, these divergences reduce to the classical KL\{s,t). 

• If the Xj's are not random but fixed, that is we consider a fixed design case, this divergence 
is the classical fixed design type divergence in which there is no expectation. 

• If the Xj's are i.i.d., this divergence is nothing but iaf " (s, i) = E [KLx{s{-\Xx),t{-\Xi))] . 

Note that this divergence is an integrated divergence as it is the average over the index i of the 
mean with respect to the law of Xi of the divergence between the conditional densities for a 
given covariate value. Remark in particular that more weight is given to regions of high density 
of the covariates than to regions of low density and, in particular, the values of the divergence 
outside the supports of the XiS are not used. In particular, if wc assume that each Xi has a law 
with density with respect to a common finite positive measure /x and that all those densities are 
lower and upper bounded then all our results hold, up to modification in constants, by replacing 
the definition of KL®" (s, t) (and their likes) by the more classical 

KL'^-{s,t)= [ KL{s{-\x),t{-\x))dii. 
Jx 

Wc stress that these types of loss is similar to the one used in the machine-learning community 
(see for instance Catoni [14] that has inspired our notations). Such kind of losses appears also, but 
less often, in regression with random design (see for instance Birge [8]) or in other conditional 
density estimation studies (sec for instance Brunei et al. [13] and Akakpo and Lacour [2]). 
When s is an estimator, or any function that depends on the observation, KLf"{s,s) measures 
this (random) integrated divergence between s and s conditionally to the observation while 
E [KLf"{s, §)] is the average of this random quantity with respect to the observations. 
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2.2 Asymptotic analysis of a parametric model 

Assume that Sm is a parametric model of conditional densities, 

Sra = {S9^{y\x) € C K^™ } , 

to which the true conditional density sq does not necessarily belongs. In this case, if we let 

Om = argmin I V - ln{se^{Yi\Xi)) 

then = . White [49] has studied this misspecified model setting for density estimation but 
its results can easily been extended to the conditional density case. 

If the model is identifiable and under some (strong) regularity assumptions on Om s$^, 
provided the x T>m matrices A{9m) and B{9m) defined by 



A{em)k,i = E 
Biem)k,i = E 



^|:/fe(.i-.).o(„,x,.. 

^|:/^(.l-.)^(»l-.)«o(.,x,)aA 



exists, the analysis of White [49] implies that, if we let 



argmin ia^"(so,S0^), 



E [KLf'^{so,Sm)] is asymptotically equivalent to 

KLf-{so,se*J + ^Tr(B(CM(0-')- 

When So belongs to the model, i.e. Sq = sg*^, B{9^) — A{6^) and thus the previous asymptotic 
equivalent of E [KLf"{so,'Sm)] is the classical parametric one 

miniaf"(so,se„) + ^^m- 

This simple expression does not hold when sq does not belong to the parametric model as 
Ti {B {9* J A{9^)~^) cannot generally be simplified. 

A short glimpse on the proof of the previous result shows that it depends heavily on the 
asymptotic normality of \/n{9„i — 9'^). One may wonder if extension of this result, often called 
the Wilk's phenomenon [-50], exists when this normality docs not hold, for instance in non 
parametric case or when the model is not identifiable. Along these lines. Fan et al. [23] propose 
a generaUzation of the corresponding Chi-Square goodness-of-fit test in several settings and 
Bouchcron and Massart [12] study the finite sample deviation of the corresponding empirical 
quantity in a bounded loss setting. 

Our aim is to derive a non asymptotic upper bound of type 

E[KL®-{so,Sm)] < ( min KLf'^{so,Sm) + +C2- 
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with as few assumptions on the conditional density set Sm as possible. Note that we only aim 
at having an upper bound and do not focus on the (important) question of the existence of a 

corresponding lower bound. 

Our answer is far from definitive, the upper bound we obtained is the following weaker one 



E 



JKL®;^{so,Sm)] < (1 + e) f inf i^f "(so, Sm) + -^m) +C2- 



in which the left-hand KL®"^{sQ,Sm) has been replaced by a smaller divergence JKL'^l{sQ,'sm) 
described below, e can be chosen arbitrary small, 'Dm is a model complexity term playing the 
role of the dimension and kq is a constant that depends on e. This result has nevertheless 
the right bias/variance trade-off flavor and can be used to recover usual minimax properties of 
specific estimators. 

2.3 Jensen-Kullback-Leibler divergence and bracketing entropy 

The main visible loss is the use of a divergence smaller than the KuUback-Leibler one (but larger 
than the squared Hellinger distance and the squared Li loss whose definitions are recalled later). 
Namely, we use the Jensen-Kullback-Leibler divergence JKLp with p e (0, 1) defined by 

JKLp{sdX, tdX) = JKLp^x{s, t) = ^KLx (s, (1 - p)s + pt) . 

Note that this divergence appears explicitly with p = 5 in Massart [38], but can also be found 
implicitly in Birge and Massart [9] and van de Geer [46]. We use the name Jensen-Kullback- 
Leibler divcrgcncx; in the same way Lin [37] uses the name Jensen- Shannon divergence for a 
sibling in his information theory work. The main tools in the proof of the previous inequality 
are deviation inequalities for sums of random variables and their suprema. Those tools require 
a boundness assumption on the controlled functions that is not satisfied by the -log-likelihood 
differences — In When considering the Jensen-Kullback-Leibler divergence, those ratios are 

implicitly replaced by ratios — ^ In ^^~p^^°^p'^'^ that are close to the -log-likelihood differences 

when the Sm are close to sq and always upper bounded by — '"^^"^^ . This divergence is smaller 
than the Kullback-Leibler one but larger, up to a constant factor, than the squared Hellinger 
one, d\{s,t) = J^-^ \y/s — v^l^dA, and the squared Li distance, ||s — t\W ^ = (J^ \s — t\dX) , as 
proved in Appendix 

Proposition 1. For any probability measures sdX and tdX and any p £ (0, 1) 

Cpdlis.t) < JKLp^x{s,t) < KLx{s,t). 

with Cp= - min { -, 1 ) (in f 1 + — | — p\ while 

P \ P J \ \ ^-PJ J 

max(C,/4,p/2)||s-<||2_i < JKLp,x{s,t) < KLx{s,t). 
Furthermore, if sdX ^ MA then 

dl{s, t) < KLx{s, f) < (2 + In ^ ) dl{s, t) 

while 



\\\s-t\\l^<KLx{s,t)< 



t\\l2- 
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More precisely, as we are in a conditional density setting, we use their tensorized versions 



df"(s,t)=E 



-Y,dl{s{-\Xi),t{-\Xi)) and Jia^^(s,t)=E -Y^JKL,^M-\Xi),t{-\Xi)) 

»=1 J L «=1 



We focus now on the definition of the model complexity Dm- It involves a bracketing entropy 

condition on the model S„i with respect to the Hellinger type divergence df"{s, t) = \J d^" {s, t). 

Abracket [t",^'''] is a pair of functions such that V(a;,j/) G Xxy ,t~ {y\x) < t^{y\x). Aconditional 
density function s is said to belong to the bracket [t^, t^] if V(a;, y) G X xy, t~{y\x) < s{y\x) < 
t~^{y\x). The bracketing entropy H^^ j^^n{S,S) of a set S is defined as the logarithm of the 

minimum number of brackets [t~ , t"*"] of width df" {t~ , t"*") smaller than S such that every function 
of S belongs to one of these brackets. depends on the bracketing entropies not of the global 
models Sm but of the ones of smaller localized sets Sm(s,a) = {sm € Sm\df"'{'s,Sm) < cr}. 
Indeed, we impose a structural assumption: 

Assumption (Hm). There is a non- decreasing function 4>m{S) such that 6 j0m('^) *s non- 
increasing on (0, +00) and for every a € and every Sm G Sm 



r 



x 



{S, Sm{Sm, (T))d6 < (pm{o-)- 



Note that the function a ^/^l] rf®" '^^ '^^^ always satisfy this assumption. Dm 

is then defined as ncr^ with cr^ the unique root of —^m{<^) = V^c- A good choice of ^m is one 

cr 

which leads to a small upper bound of Sm- This bracketing entropy integral, often call Dudley 
integral, plays an important role in empirical processes theory, as stressed for instance in van der 
Vaart and Wellner [47] and in Kosorok [35]. The equation defining am corresponds to a crude 
optimization of a suprcmum bound as shown explicitly in the proof. This definition is obviously 
far from being very explicit but it turns out that it can be related to an entropic dimension of the 
model. Recall that the classical entropy dimension of a compact set S with respect to a metric d 
can be defined as the smallest non negative real V such that there is a non negative V such that 



y6>0,Hd{S,S)<V + Vlog 



where Hd is the classical entropy with respect to metric d. The parameter V can be interpreted 
as the logarithm of the volume of the set. Replacing the classical entropy by a bracketing one, 
we define the bracketing dimension Vm of a compact set as the smallest real V such that there 
is a V such 



V(5>0,if[.],d((5,5) < V + Dlog 



As hinted by the notation, for parametric model, under mild assumption on the parametrization, 
this bracketing dimension coincides with the usual one. Under such assumption, one can prove 
that Dm is proportional to Vm- More precisely, working with the localized set Sm{s,(T) instead 
of Sm, we obtain in Appendix, we obtain 

Proposition 2. • if 3D„ > 0, 3C„ > 0, e (0, \/2], (<5, 5„) < V„ + D„ In ^ then 
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- ifVm> 0, (Hm) holds with S)„ < ( 2C*,„ + 1 + ( In ^ ) ) with 



^ 2 

/TT 



~ ifT^m = 0, f^/m/' /loWs with 4>m{<^) = cVHra SUcft </ia< = Vmi 

1/ 3P„ > 0, 3V„ > 0, Va e (0, \^], e (0, ct], {6, Sm{sm, (^)) < + P„ In ^ t/ien 

~ ifT^m > 0; f^mj holds with (f>m such that 'Dm = Ci,^rnVm with Ci,^rn = (^^J g™" + V^) > 

— ifVm = 0, (Hm) holds with 0m(c) = a^Vm such that "Dm = Vn 



^ m ■ 



Note that we assume bounds on the entropy only for S and a smaller than y/2, but, as for 
any conditional densities pair (s, t) df" (s, t) < \/2, 

mi 

a A 72)) 

which implies that those bounds are still useful when S and a are large. Assume now that all 
models are such that ^ < C, i.e. their log-volumes Vm grow at most linearly with the dimension 
(as it is the case for instance for hypercubes with the same width). One deduces that Assumptions 

(Hm) hold simultaneously for every model with a common constant C* = (VC + V^) . The 
model complexity Dm can thus be chosen roughly proportional to the dimension in this case, 
this justifies the notation as well as our claim at the end of the previous section. 

2.4 Single model mciximum likelihood estimation 

For technical reason, we also need a separability assumption on our model: 

Assumption (Sep^). There exist a countable subset S'm of Sm and a set y'm with A(3^\3^^) = 
such that for every t G Sm, there exists a sequence {tk)k>i of elements of S'm such that for every 
X and for every y G y'm, \n{tk{y\x)) goes to \n{t{y\x)) as k goes to infinity. 

We are now ready to state our risk bound theorem: 

Theorem 1. Assume we observe {Xi,Yi) with unknown conditional density sq. Assume Sm is 
a set of conditional densities for which Assumptions (Hm) and (Sepm) hold and let Sm be a r/ 
-log-likelihood minimizer in Sm 

n / n \ 

5^-ln(s„(y,|Xi)) < ^ inf \y^-\^{8m{Yi\Xi)) +7? 

i=l "* \i=\ ) 

Then for any p G (0, 1) and any C'l > 1. there are two constants kq and C'2 depending only 

on p and Ci such that, for Dm = na^ with am the unique root of —4'm{<^) = \fna, the likelihood 

^ a 
estimate satisfies 



E 



JKLfl{so,Sm)\ <C,( inf KLf'^{so,Sm) + -Dm) + C2- + 
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This theorem holds without any assumption on the design Xi, in particular we do not assume 
that the covariates admit upper or lower bounded densities. The law of the design appears 
however in the divergence JKLf" and KLf"^ used to assess the quality of the estimate as well as 
in the definition of the divergence df" used to measure the bracketing entropy. By construction, 
those quantities however do not involve the values of the conditional densities outside the support 
of the XiS and put more focus on the regions of high density of covariates than the other. Note 
that Assumption could be further localized: it sufiices to impose that the condition on the 
Dudley integral holds for a sequence of minimizer of ^^^"(50; Sm)- 

We obtain thus a bound on the expected loss similar to the one obtained in the parametric case 
that holds for finite sample and that do not require the strong regularity assumptions of White 
[49]. In particular, we do not even require an identifiability condition in the parametric case. 
As often in empirical processes theory, the constant kq appearing in the bound is pessimistic. 
Even in a very simple parametric model, the current best estimates are such that Ko^m is still 
much larger than the variance of Section 2.2. Numerical experiments show there is a hope that 
this is only a technical issue. The obtained bound quantifies however the expected classical bias- 
variance trade-off: a good model should be large enough so that the true conditional density 
is close from it but, at the same time, it should also be small so that the Dm term does not 
dominate. 

It should be stressed that a result of similar fiavor could have been obtained by the information 
theory technique of Barron et al. [4] and Kolaczyk et al. [34]. Indeed, if we replace the set 
by a discretized version &m so that 

inf KLf-{so,Sm)< inf iaf "(sq, s„) + -, 

then, if we let be a -log-likelihood minimizer in ©^j 

E[pf "(so,Sm)] < inf KLf-{so,Sm) + -ln\em\ + - 

smSSm n n 

where 2?^**" is the tensorized Bhattacharyya-Renyi divergence, another divergence smaller than 
KL'^", \&m\ is the cardinality of &m and expectation is taken conditionally to the covariates 
(Ari)i<i<„. As verified by Barron et al. [4] and Kolaczyk et al. [34], can be chosen of 
cardinality of order In n Vm when the model is parametric. We obtain thus also a bound of type 

E[pf "(so.Bm)] < inf KLf-{so,Sm) + —lnnVm+-. 

with better constants but with a different divergence. The bound holds however only condi- 
tionally to the design, which can be an issue as soon as this design is random, and requires to 
compute an adapted discretization of the models. 



3 Model selection and penalized maximum likelihood 

3.1 Framework 

A natural question is then the choice of the model. In the model selection framework, instead 
of a single model Sm, we assume we have at hand a collection of models <S = {Sm}meM- ^® 
assume that Assumptions (H^) and (Sep„) hold for all models, then for every model Sm 



E 



JKL'^l{so,Sm) 



inf KLf-{so,Sm) + -D. 



C-2 



v 

n 
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Obviously, one of the models minimizes the right hand side. Unfortunately, there is no way to 
know which one without knowing so, i.e. without an oracle. Hence, this oracle model can not be 
used to estimate sq. Wc nevertheless propose a data-driven strategy to select an estimate among 
the collection of estimates {smjmeM according to a selection rule that performs almost as well 
as if we had known this oracle. 

As always, using simply the -log-likelihood of the estimate in each model 



as a criterion is not sufficient . It is an underestimation of the true risk of the estimate and 
this leads to choose models that are too complex. By adding an adapted penalty pen(m), one 

hopes to compensate for both the variance term and the bias between - X]i=i ^ s^{y \x ') 
inis^eSm -Klf "(so, Sm)- For a given choice of pen(TO), the best model S;^ is chosen as the one 
whose index is an almost minimizer of the penalized r] -log-likelihood : 

V - ln{s^{Yi\Xi)) + pen(m) < inf V - ln{sm{Yi\Xi)) + pen(m) + rj'- 

i=l \i=l / 

The analysis of the previous section turns out to be crucial as the intrinsic complexity "Dm 
appears in the assumption on the penalty. It is no surprise that the complexity of the model 

collection itself also appears. We need an information theory type assumption on our collection; 
we assume thus the existence of a Kraft type inequality for the collection: 

Assumption (K). There is a family {xm)m€M of non-negative number such that 

It can be interpreted as a coding condition as stressed by Barron et al. [4] where a similar 
assumption is used. Remark that if this assumption holds, it also holds for any permutation 
of the coding term Xm- Wc should try to mitigate this arbitrariness by favoring choice of Xm 
for which the ratio with the intrinsic entropy term 25„ is as small as possible. Indeed, as the 
condition on the penalty is of the form 

pen(m) > k {Dm + Xm) , 

this ensures that this lower bound is dominated by the intrinsic quantity Dm- 



3.2 A general theorem for penalized mciximum likelihood conditional 
density estimation 

Our main theorem is then: 

Theorem2. Assum,e we observe (Xi,Yi) with unknown conditional density sq . LetS — {Sm)rneM 
be a,t m,ost countable collection of conditional density sets. Assume Assumption (K) holds while 
Assumptions (Hm) and (Sepm) hold for every model Sm € «S. Let Sm be a r] -log-likelihood 
minimizer in Sm 

n / \ 

Y,-ln{sm{Y,\Xi)) < ^ inf - ln(s„(y,|Xi)) + r, 

i=i "* \i=i / 
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Then for any p € (0, 1) and any C\ > 1, there are two constants kq and C2 depending only 
on p and Ci such that, as soon as for every index m G M. 



pen(m) > k (2)^ + Xm) with k> kq 



where Dm = na^ with am the unique root of — ^m(c) = Vna, the penalized likelihood estimate 
with fh such that 

m 

^ - \n{s^{Yi\Xi)) + pen(m) < ^inf^ ( ^ - lu{sm{Y,\Xi)) + pen(m) ) + r?' 



i=l 



satisfies 
E 



JKLfl{so,s-) <Ci inf ( inf KLf-{so,Sm) + '^^^] + C^- + '^. 



Note that, as in 2.4, the approach of of Barron et al. [4] and Kolaczyk et al. [34] could have 
been used to obtain a similar result with the help of discretization. 

This theorem extends Theorem 7.11 Massart [38] which handles only density estimation. As 
in this theorem, the cost of model sclec;tion with respect to the choice of the best single model is 
proved to be very mild. Indeed, let pen(m) = K{Dm + Xm) then one obtains 



<Ci inf ( inf KLf-{so.Sm) + -{Dm + Xm)]+C2^ + '^^ 



< Ci — max — — 



inf I inf KLf-{so,Sm) + -Dm)+C2^ + '^ 



As soon as the term Xm is always small relatively to Dm-, we obtain thus an oracle inequality that 
show that the penalized estimate satisfies, up to a small factor, the bound of Theorem 1 for the 
estimate in the best model. The price to pay for the use of a collection of model is thus small. 
The gain is on the contrary very important; we do not have to know the best model within a 
collection to almost achieve its performance. 

So far we do not have discussed the choice of the model collection, it is however critical to 
obtain a good estimator. There is unfortunately no universal choice and it should be adapted to 
the specific setting considered. Typically, if we consider conditional density of regularity indexed 
by a parameter a, a good collection is one such that for every parameter a there is a model 
which achieves a quasi optimal bias/variance trade-off. Efromovich [19, 20] considers Sobolev 
type regularity and use thus models generated by the first elements of Fourier basis. Brunei et al. 
[13] and Akakpo and Lacour [2] considers anisotropic regularity spaces for which they show that 
a collection of piecewise polynomial models is adapted. Although those choices are justified, in 
these papers, in a quadratic loss approach, they remain good choices in our maximum likelihood 
approach with a KuUback-Leibler type loss. Estimator associated to those collections are thus 
adaptive to the regularity: without knowing the regularity of the true conditional density, they 
select a model in which the estimate performs almost as well as in the oracle model, the best 
choice if the regularity was known. In both cases, one could prove that those estimators achieve 
the ininimax rate for the considered classes, up to a logarithmic factor. 

As in Section 2.4, the known estimate of constant kq and even of can be pessimistic. 
This leads to a theoretical penalty which can be too large in practice. A natural question is thus 
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whether the constant appearing in the penalty can be estimated from the data without loosing 
a theoretical guaranty on the performance? No definitive answer exists so far, but numerical 
experiment in specific case shows that the slope heuristic proposed by Birge and Massart [10] 
may yield a solution. 

The assumptions of the previous theorem are as general as possible. It is thus natural to ques- 
tion the existence of interesting model collections that satisfy its assumptions. We have mention 
so far the Fourier based collection proposed by Efromovich [20, 19] and the piecewise polynomial 
collection of Brunei et al. [13] and Akakpo and Lacour [2] considers anisotropic regularity. We 
focus on a variation of this last strategy. Motivated by an application to unsupervised image 
segmentation, we consider model collection in which, in each model, the conditional densities 
depend on the covariate only in a piecewise constant manner. After a general introduction to 
these partition-based strategics, we study two cases: a classical one in which the conditional 
density depends in a piecewise polynomial manner of the variables and a newer one, which cor- 
respond to the unsupervised segmentation application, in which the conditional densities are 
Gaussian mixture with common Gaussian components but mixing proportions depending on the 
covariate. 



4 Partition-based conditional density models 

4.1 Covariate partitioning and conditional density estimation 

Following an idea developed by Kolaczyk et al. [34], we partition the covariate domain and 
consider candidate conditional density estimates that depend on the covariate only through the 
region it belongs. We are thus interested in conditional densities that can be written as 

S{y\x) = «(yl^')l{a:6-R.i} 

where V is partition of X, TZi denotes a generic region in this partition, 1 denotes the character- 
istic function of a set and s{y\TZi) is a density for any TZi G V. Note that this strategy, called as 
in Willett and Nowak [51] partition-based, shares a lot with the CART-type strategy proposed 
by Donoho [18] in an image processing setting. 

Denoting the number of regions in this partition, the model we consider are thus specified 
by a partition V and a set J" of [["Plj-tuples of densities into which (s(-|7?.;))7j,g-p is chosen. This 
set T can be a product of density sets, yielding an independent choice on each region of the 
partition, or have a more complex structure. We study two examples: in the first one, is 
indeed a product of piecewise polynomial density sets, while in the second one J" is a set of 
||P||-tuplcs of Gaussian mixtures sharing the same mixture components. Nevertheless, denoting 
with a slight abuse of notation S-p^jr such a model, our 77-log-likelihood estimate in this model is 
any conditional density "s-p^jr such that 

V-ln(sp,^(y,|Xi)) ) < min ( V - ln(sp,^(Fi|X,)) ) + 77. 

We first specify the partition collection we consider. For the sake of simplicity we restrict our 
description to the case where the covariate space X is simply [0, 1]**^ . We stress that the proposed 
strategy can easily be adapted to more general settings including discrete variable ordered or 
not . Wc impose a strong structural assumption on the partition collection considered that allows 
to control their complexity. We only consider five specific hyperrectangle based collections of 
partitions of [0,1]"^^: 
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Figure 1: Example of a recursive dyadic partition with its associated dyadic tree. 



• Two are recursive dyadic partition collections. 

— The uniform dyadic partition collection (UDP(A')) in which all hypercubes are subdi- 
vided in 2*^^ hypercubes of equal size at each step. In this collection, in the partition 
obtained after J step, all the 2''^'^ hyperrectangles {'R-i}i<i<\\v\\ a-re thus hypercubes 
whose measure |7^/| satisfies |7^^| = 2"''^'^. We stop the recursion as soon as the 
number of steps J satisfies > \TZi \ > ^. 

— The recursive dyadic partition collection (RDP(A')) in which at each step a hypercube 
of measure \Tli\ > is subdivided in 2*^^ hypercubes of equal size. 

• Two are recursive split partition collections. 

— The recursive dyadic spht partition (RDSP(A')) in which at each step a hyperrectangle 
of measure \TZi\ > - can be subdivided in 2 hyperrectangles of equal size by an even 
split along one of the dx possible directions. 

— The recursive split partition (RSP(A')) in which at each step a hyperrectangle of 
measure \TZi\ > ^ can be subdivided in 2 hyperrectangles of measure larger than ^ 
by a split along one a point of the grid in one the dx possible directions. 

• The last one does not possess a hierarchical structure. The hyperrectangle partition col- 
lection (HRP(A:')) is the full collection of all partitions into hyperrectangles whose corners 
are located on the grid -Z'^^ and whose volume is larger than -. 

We denote by Sj, the corresponding partition collection where *{X) is either UDP(A'), RDP(A'), 
RDSP(A'), RSP(A') or HRP(A'). 

As noticed by Kolaczyk and Nowak [33], Huang et al. [29] or Willett and Nowak [51], the 
first four partition collections, {S^^^'^'^\ S^^^^'^\ 5^^^^'^^), have a tree structure. 

Figure 1 illustrates this structure for a RDP(A') partition. This specific structure is mainly 
used to obtain an efficient numerical algorithm performing the model selection. For sake of 
completeness, we have also added the much more complex to deal with collection iS™^^"^"*, for 
which only exhaustive search algorithms exist. 

As proved in Appendix, those partition collections satisfy Kraft type inequalities with weights 
constant for the IJDP{X) partition collection and proportional to the number of hyperrect- 
angles for the other collections. Indeed, 

Proposition 3. For any of the five described partition collections Sj, , 3Aq, Bq,Cq and Y^q 
such that for all c > Cq^'^^ : 



•Pes;' 




— c max 




) 
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Those constants can be chosen as follow: 





★ = UDP(A') 


* = RDP(A') 


★ = RDSPCA") 


* = RSFCA") 


★ = HRP(A') 







~^ dxln2 



In 2 

2 




[ln(l + dx)lin2 
2 

2(1 + dx) 




rin(l +fix)lln2 

+ [In n] In 2 
2 

4(1 + dx)n 




dx [In n]in2 
1 

(2n)''^ 



where [x] in 2 is the smallest multiple of In 2 larger than x. Furthermore, as soon as c > 2 In 2 the 
right hand term of the bound is smaller than 1. This will prove useful to verify Assumption (K) 
for the model collections of the next sections. 



In those sections, we study the two different choices proposed above for the set J^. We first 
consider a piecewise polynomial strategy similar to the one proposed by Willett and Nowak [51] 
defined for y ~ [0, l]''"^ in which the set J" is a product of sets. We then consider a Gaussian 
mixture strategy with varying mixing proportion but common mixture components that extends 
the work of Maugis and Michel [39] and has been the original motivation of this work. In both 
cases, we prove that the penalty can be chosen roughly proportional to the dimension. 



4.2 Piecewise polynomial conditional density estimation 

In this section, we let X = [0, 1]*^^, y = [0, 1]**^ and A be the Lebesgue measure dy. Note that, 

in this case, A is a probability measure on y. Our candidate density s{y\x G TZi) is then chosen 
among piecewise polynomial densities. More precisely, we reuse a hyperrectangle partitioning 
strategy this time for y = [0, l]'*'^ and impose that our candidate conditional density s{y\x G TZi) 
is a square of polynomial on each hyperrectangle TZfi^ of the partition Q;. This differs from the 
choice of Willett and Nowak [51] in which the candidate density is simply a polynomial. The two 
choices coincide however when the polynomial is chosen among the constant ones. Although our 
choice of using squares of polynomial is less natural, it already ensures the positivcncss of the 
candidates so that we only have to impose that the integrals of the piecewise polynomials are 
equal to 1 to obtain conditional densities. It turns out to be also crucial to obtain a control of the 
local bracketing entropy of our models. Note that this setting differs from the one of Blanchard 
et al. [11] in which 3^ is a finite discrete set. 

We should now define the sets T we consider for a given partition V = {T^i}i<i<\\p\\ of 
X = [0,1]''^. Let D = (Di, . . . , Ddy,), we first define for any partition Q = {'TlJ.}i<k<\\Q\\ of 
y = [0, l]'^"^ the set J-"q,d of squares of piecewise polynomial densities of maximum degree D 
defined in the partition Q: 



VT?.^ e Q, P^y polynomial of degree at most D, 



For any partition collection Q'^ = (Q;)^^;^||^|| = (^{7?.^fc}i<fc<|[e,|| j ^^^^^^^^^ of y = [0,1]'^'^, we 
can thus defined the set Tqp of HT'll -tuples of piecewise polynomial densities as 

•^a^,D = {(s(•|7^^))^,gp|V7^^ e V,s{-\ni) € Tq.^d} ■ 
The model S-p^j^^^ ^, that is denoted Sqv_-£, with a slight abuse of notation, is thus the set 



'Q^,D = I s{y\x) = ^(y\'^i)'^{=oeni} 



is{y\ni) 
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= < 



xTZ^ polynomial of degree at most D, 



Denoting TZ^^ the product TZi x 7?.f ^, the conditional densities of the previous set can be advan- 
tageously rewritten as 



As shown by Willett and Nowak [51], the maximum likelihood estimate in this model can be 
obtained by an independent computation on each subset Hfj^'- 



■ 



— =^ — argmm ^ 

l^i=\ -"-{JCSKi} P,deg(P)<D, r y p-^(y)dy=l i=l 



This property is important to be able to use the efficient optimization algorithms of Willett and 
Nowak [51] and Huang et al. [29]. 

Our model collection is obtained by considering all partitions V within one of the UDP(A'), 
RDP(A'), RDSP(A'), RSP(A') or HRP(A') partition collections with respect to [0,1]''^ and, for 
a fixed V, all partitions Qi within one of the UDP(3^), RDP(3;), RDSP(3^), RSP(3^) or HRP(3;) 
partition collections with respect to [0, l]'^^. By construction, in any cases. 



dim(5a^,D)= Y ( 112*11 n(Drf + l) - 1) • 



To define the penalty, we use a slight upper bound of this dimension 

2?c-,D= Y iiQ/iin(D,+i)=iiQ^iin(D,+i) 

where ||Q^|| = Y^ is the total number of hyperrectangles in all the partitions: 

Theorem 3. Fix a collection -k{X) among UDP(A'), RDP(A'), RDSP(A'), RSP(A') or HRP(A') 
for X = [0,1]'^^, a collection *{y) among UDP(3^), RDP(y), RDSP(3^), RSP(3^) or HRP(y) 
and a maximal degree for the polynomials D e N''^ . 
Let 

S = {Sqv^o\v = {7^^} e S^^-^'> and VUi GV,Qie 5;^^'} . 

Then there exist a > and a c* > independent of n, such that for any p and for any 
C\> 1, the penalized estimator of Theorem 2 satisfies 



E 



Ji^L^_^(so,Sg^)l <Ci^ inf J inf i^Lf " (sq, Sa^o) + 



pen(Q^,D) 



^0^,0^^ qP Ql-' ,-D 

n n 
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as soon as 



pen(Q^,D) > kPqp,d 

for 

~K > no (a + c. ) + + + B:^^^) + 2 In , 

where kq and C2 are the constants of Theorem 2 that depend only on p and C\. Furthermore 
< 5 ln(87re) + Yfd'Li In (%/2(Dd + 1)) and < 2 In 2. 

A penalty chosen proportional to the dimension of the model, the multiplicative factor k 
being constant over n up to a logarithmic factor, is thus sufficient to guaranty the estimator per- 
formance. Furthermore, one can use a penalty which is a sum of penalties for each hyperrectangle 
of the partition: 

pen(Q^,D)= ^ n(\[{Tia + l)\ . 

This additive structure of the penalty allows to use the fast partition optimization algorithm of 
Donoho [18] and Huang et al. [29] as soon as the partition collection is tree structured. 
In Appendix, we obtain a weaker requirement on the penalty 




pen(Q^,D)>Ac( ( + 21n -^== 1 Dg^.D 



+ cAAr^+{Br'+Ar')\\V\\ + B^ 




in which the complexity part and the coding part appear more explicitly. This smaller penalty is 
no longer proportional to the dimension but still sufficient to guaranty the estimator performance. 
Using the crude bound ||Q^|| > 1, one sees that such a penalty penalty can still be upper bounded 
by a sum of penalties over each hyperrectangle. The loss with respect to the original penalty is 
of order Klog ||Q''||2?qt',Di which is negligible as long as the number of hyperrectangle remains 
small with respect to n^. 

Some variations around this Theorem can be obtained through simple modifications of its 
proof as explained in Appendix. For example, the term 21n(n/-\/||Q'^||) disappears if V belongs 

to 5™^^'*' while Qi is independent of TZi and belongs to 5™^^"^' . Choosing the degrees D of 

the polynomial among a family either globally or locally as proposed by Willett and Nowak 
[51] is also possible. The constant C^, is replaced by its maximum over the family considered, 
while the coding part is modified by replacing respectively Ag^'*^ by A^'^'' + In \'D^\ for a global 
optimization and B^^^^ by Bq^'^^ +ln|I?*'^^| a the local optimization. Such a penalty can be 
further modified into an additive one with only minor loss. Note that even if the family and 
its maximal degree grows with n, the constant grows at a logarithic rate in n as long as the 
maximal degree grows at most polynomially with n. 

Finally, if we assume that the true conditional density is lower bounded, then 



KLf'^{s,t)< 



II* ''IIa,2 
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as shown by Kolaczyk and Nowak [33]. We can thus reuse ideas from Willett and Nowak [51], 
Akakpo [1] or Akakpo and Lacour [2] to infer the quasi optimal minimaxity of this estimator 

for anisotropic Bcsov spaces (sec for instance in Karaivanov and Pctrushcv [32] for a definition) 
whose regularity indices are smaller than 1 along the axes of X and smaller than D + 1 along 
the axes of y. 



4.3 Spatial Gaussian mixtures, models, bracketing entropy and penal- 
ties 

In this section, we consider an extension of Gaussian mixture that takes account into the covariate 
into the mixing proportion. This model has been motivated by the unsupervised hyperspectral 
image segmentation problem mentioned in the introduction. We recall first some basic facts 
about Gaussian mixtures and their uses in unsupervised classification. 

In a classical Gaussian mixture model, the observations are assuming to be drawn from several 
different classes, each class having a Gaussian law. Let K be the number of different Gaussians, 
often call the number of clusters, the density sq of Fj with respect to the Lebesgue measure is 
thus modeled as 

K 

SK,eA-) = ^T^k'^Ok (•) 



fe=l 



where 



(27rdetEfe)P/^ 



with jjik the mean of the fcth component, its covariance matrix, 6k = {fik,'Sk) and TTk its 
mixing proportion. A model SK,g is obtained by specifying the number of component K as well 
as a set G to which should belong the if-tuple of Gaussian {^e^, ■ . ■ , ^0ji)- Those Gaussians 
can share for instance the same shape, the same volume or the same diagonalization basis. The 
classical c;lioices are described for instance in Biernacki et al. [7]. Using the EM algorithm, or 
one of its extension, one can efficiently obtain the proportions tt^ and the Gaussian parameters 
6k of the maximum likelihood estimate within such a model. Using tools also derived from 
Massart [38], Maugis and Michel [39] show how to choose the number of classes by a penalized 
maximum likelihood principle. These Gaussian mixture models are often used in unsupervised 
classification application: one observes a collection of Yi and tries to split them into homogeneous 
classes. Those classes are chosen as the Gaussian components of an estimated Gaussian mixture 
close to the density of the observations. Each observation can then be assigned to a class by a 
simple maximum likelihood principle: 

k{y) = argmax7ffe$- iv)- 

l<k<K 

This methodology can be applied directly to an hyperspectral image and yields a segmentation 
method, often called spectral method in the image processing communit. This method however 
fails to exploit the spatial organization of the pixels. 

To overcome this issue, Kolaczyk et al. [34] and Antoniadis et al. [3] propose to use mixture 
model in which the mixing proportions depend on the covariate Xi while the mixture components 
remain constant. We propose to estimate simultaneously those mixing proportions and the 
mixture components with our partition-based strategy. In a semantic analysis context, in which 
documents replace pixels, a similar Gaussian mixture with varying weight, but without the 
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partition structure, has been proposed by Si and Jin [43] as an extension of a general mixture 
based semantic analysis model introduced by Hofmann [28] under the name Probabilistic Latent 

Semantic Analysis. A similar model has also been considered in the work of Young and Hunter 
[52]. In our approach, for a given partition V, the conditional density s{-\x) are modeled as 



The -fC-tuples of Gaussian can be chosen is the same way as in the classical Gaussian mixture case. 
Using a penalized maximum likelihood strategy, a partition V, a number of Gaussian components 
K, their parameters 6k and all the mixing proportions tt [7^;] can be estimated. Each pair of pixel 
position and spectrum {x, y) can then be assigned to one of the estimated mixture components 
by a maximum likelihood principle: 



This is the strategy we have used at IPANEMA [6] to segment, in an unsupervised manner, 
hyperspectral images. In these images, a spectrum Fj, with around 1000 frequency bands, is 
measured at each pixel location Xi and our aim was to derive a partition in homogeneous regions 
without any human intervention. This is a precious help for users of this imaging technique 
as this allows to focus the study on a few representative spectrums. Combining the classical 
EM strategy for the Gaussian parameter estimation (see for instance Biernacki et al. [7]) and 
dynamic programming strategics for the partition, as described for instance by Kolaczyk et al. 
[34], we have been able to implement this penalized estimator and to test it on real datasets. 

Figure 2 illustrates this methodology. The studied sample is a thin cross-section of maple 
with a single layer of hide glue on top of it, prepared recently using materials and processes from 
the Cite de la Musique, using materials of the same type and quality that is used for lutherie. 
This sample is to serve as reference material to study the spectral variation of the hide glue 
at the various steps of the process. We present here the result for a low signal to noise ratio 
acquisition requiring only two minutes of scan. Using piecewise constant mixing proportions 
instead of constant mixing proportions leads to a better geometry of the segmentation, with less 
isolated points and more structured boundaries. As described in a more applied study [16], this 
methodology permits to work with a much lower signal to noise ratio and thus allows to reduce 
significantly the acquisition time. 

We should now specify the models we consider. As we follow the construction of Section 4.1, 
for a given segmentation V, this amounts to specify the set to which belong the UPlj-tuples 
of densities {s{y\TZi))^^^^. As described above, we assume that s{y\TZi) — J2k=i^k[Ti-i]^ek{y)- 
The mixing proportions within the region TZi, n[TZi], are chosen freely among all vectors of the 
K — 1 dimensional simplex Sk-i- 




which, denoting tt[TZ{x)] = ^[T^i] ^{xeni}^ can advantageously be rewritten 



K 



k{x,y) = argmax7rfe[7e((a;)]$- (y). 



l<k<K 
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As we assume the mixture components are the same in each region, for a given number of 
components K, the set T is entirely specified by the set Q of JT-tuples of Gaussian {^g^ , ■ • ■ , ) 
(or equivalently by a set 6 for = . . . , 9k))- 

To allow variable selection, we follow Maugis and Michel [39] and let E be an arbitrary 
subspace of 3^ = MP, that is expressed differently for the different classes, and let E-^ be its 
orthogonal, in which all classes behave similarly. We assume thus that 

^eM^^eEjyE)'^e^AyE^) 

where ue and y^jj- denote, respectively, the projection of y on i? and E^ , <i>g^ t is a Gaussian 
whose parameters depend on k while is independent of k. A model is then specified by the 

choice of a set for the if -tuples {^Qe n ■ ■ ■ i ^Oe k ) (o^' equivalently a set 0|| for the if-tuples 
of parameters {Oe,i^ ■ ■ ■ , Oe,k)) and a set Ge^ for the Gaussian ^e^^^ (or equivalently a set 6^^ 
for its parameter 9eA- The resulting model is denoted S'p,K,g 



K 



Sv,K,g = < svMfi,-n{v\x) = ^7rfc[7^(a;)]$^)^_, {ve) ^e^^ {veA 



k=l 



:) 



i^eE.i. 
V7^^ e7',^[7^^] e^A'-i 



The sets Qe and Qe±- are chosen among the classical Gaussian if-tuples, as described for 
instance in Biernacki et al. [7]. For a space E of dimension pE and a fixed number K of classes, 
we specify the set 



Efi 



,<^E,eAe^{e^, 



through a parameter set Q[.]k defined by some (mild) constraints on the means /i^ and some 

(strong) constraints on the covariance matrices Sfc. 

The if-tuple of means = (^i, . . . , ^bx) is either known or unknown without any restriction. 
A stronger structure is imposed on the if-tuple of covariance matrices (Ei, . . . , Y,k)- To define it, 
we need to introduce a decomposition of any covariance matrix E into LOAD' where, denoting 
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|E| the determinant of S, L = is a positive scalar corresponding to the volume, D is 

the matrix of eigenvectors of S and A the diagonal matrix of renormalized eigenvalues of E 
(the eigenvalues of |E|~-'^/p^E). Note that this decomposition is not unique as, for example, D 
and A are defined up to a permutation. We impose nevertheless a structure on the JsT-tuple 
(El, . . . , Ex) through structures on the corresponding /i'-tuples of (ii, . . . , Lk), (-Di, • • • , Dk) 
and {Ai, . . . , Ak)- They arc cither known, unknown but with a common value or unknown 
without any restriction. The corresponding set is indexed by [/x* ^*]pE where ★ = means 

that the quantity is known, -k = K that the quantity is unknown without any restriction and 
possibly different for every class and its lack means that there is a common unknown value over 
all classes. 

To have a set with finite bracketing entropy, we further restrict the values of the means /ifc, 

the volumes Lk and the renormalized eigenvalue matrix Ak- The means arc assumed to satisfy 
\fl < k < K, \ fik\ < a for a known a while the volumes satisfy VI < fc < K,L- < Lft < L+ 
for some known positive values i_ and L^. To describe the constraints on the renormalized 
eigenvalue matrix A^, wc define the set A{X-,\+,Pe) of diagonal matrices A such that \A\ = 1 
and VI < i < pe,X- < < A+. Our assumption is that all the A^ belong to A{X-, X+,pe) 
for some known values A_ and A+. 

Among the S** = 81 such possible sets, six of them have been already studied by Maugis and 
Michel [39, 41] in their classical Gaussian mixture model analysis: 

• [/zo Do Aq]^ in which only the volume of the variance of a class is unknown. They use 
this model with a single class to model the non discriminant variables in E-^. 

• [iJ.K Do A-if]^ in which one assumes that the unknown variances Efc can be diagonalized 
in the same known basis Dq. 

• [hk Lk Dk ^k]pe ill which everything is free, 

• [/iK LDq A]^ in which the variances Efe are assumed to be equal and diagonalized in the 
known basis Dq. 

• [fiK L Do A/f ]^ in which the volumes L/j are assumed to be equal and the variance can be 

diagonalized in the known basis Dq 

• [iJ,K L D A]^ in which the variances E^ are only assumed to be equal 

All these cases, as well as the others, are covered by our analysis with a single proof. 

To summarize, our models S-p^K,g are parametrized by a partition V, a number of compo- 
nents K, a set G of ii'-tuples of Gaussian specified by a space E and two parameter sets, a 
set ©[/J, L* D* A«]^ of X-tuples of Gaussian parameters for the differentiated space E and a set 

L« D« A,]p ^ of Gaussian parameters for its orthogonal E-^. Those two sets are chosen among 
the ones described above with the same constants a, L_, i+, A_ and A+. One verifies that 

dim(5'-p,x,a) = ll"^!! - 1) + dim {q^^^ a,]^^ ) + dim {q^^^ ) • 

Before stating a model selection theorem, wc should specify the collections S considered. We 
consider sets of model S-p^k,q with V chosen among one of the partition collections S^, K smaller 
than Km, which can be theoretically chosen equal to +oo, a space E chosen as span{ei}ig/ where 
e,; is the canonical basis of W and / a subset of {1, . . . ,p} is either known, equal to {1, . . . ,Pe} 
or free and the indices [/x,^L^D^ A^] of O^; and are chosen freely among a subset of the 

possible combinations. 

Without any assumptions on the design, we obtain 
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Theorem 4. Assume the collection S is one of the collections of the previous paragraph. 

Then, there exist a Ci, > -k and a c* > 0, such that, for any p and for any C\ > 1, the 
penalized estimator of Theorem 2 satisfies 



E 



< C\ inf 



i^t KLf"^{so,S'p,K,g) 



as soon as 



pen{V, K, Q) > m Aim{S'p,K,g) + K2T>e 



pen{V,K,g) 
n 



n n 



for 



Kl > K 2(7^ + 1 



In 



eC\ 



and 



K2 > KC-k 



with K > kq where kq and C2 are the constants of Theorem 2 that depend only on p and Ci and 
'0 

'De= (pe 



if E is known, 

if E is chosen among spaces spanned by 
the first coordinates, 
[(l + ln2 + ln^)pB if E is free. 



As in the previous section, the penalty term can thus be chosen, up to the variable selection 
term Ve, proportional to the dimension of the model, with a proportionality factor constant 
up to a logarithmic term with n. A penalty proportional to the dimension of the model is thus 
sufficient to ensure that the model selected performs almost as well as the best possible model 
in term of conditional density estimation. As in the proof of Antoniadis et al. [3] , we can also 
obtain that our proposed estimator yields a minimax estimate for spatial Gaussian mixture with 
mixture proportions having a geometrical regularity even without knowing the number of classes. 

Moreover, again as in the previous section, the penalty can have an additive structure, it can 
be chosen as a sum of penalties over each hyperrectangle plus one corresponding to K and the 
set Q. Indeed 



pen(P, K,g)= «i + Ki (dim (s 



dim (e^^, A.]^^^ ) ) + K2'De 



satisfies the requirement of Theorem 4. This structure is the key for our numerical minimization 
algorithm in which one optimizes alternately the Gaussian parameters with an EM algorithm 
and the partition with the same fast optimization strategy as in the previous section. 
In Appendix, we obtain a weaker requirement 



pen{V,K,g) > k { 2^ + 1 + In 



] 



+ Bl^""^ \\V\\ + {K-l)+ Ve) j 



in which the complexity and the coding terms are more explicit. Again up to a logarithmic term 
in d\u\{S-p^K,g)-, this requirement can be satisfied by a penalty having the same additive structure 
as in the previous paragraph. 
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Our theoretical result on the conditional density estimation does not guaranty good seg- 
mentation performance. If data are generated according to a Gaussian mixture with varying 
mixing proportions, one could nevertheless obtain the asymptotic convergence of our class esti- 
mator to the optimal Bayes one. We have nevertheless observed in our numerical experiments 
at IPANEMA that the proposed methodology allow to reduce the signal to noise ratio while 
keeping meaningful segmentations. 

Two major questions remain nevertheless open. Can we calibrate the penalty (choosing the 
constants) in a datadriven way while guaranteeing the theoretical performance in this specific 
setting? Can we derive a non asymptotic classification result from this conditional density 
result? The slope heuristic, proposed by Birge and Massart [10], we have used in our numerical 
experiments, seems a promising direction. Deriving a theoretical justification in this conditional 
estimation setting would be much better. Linking the non asymptotic estimation behavior to a 
non asymptotic classification behavior appears even more challenging. 



4.4 Bracketing entropy of Gaussian families 

A key ingredient in the proof of 4 is a generalization of a result of Maugis and Michel [39, 40] 
controlling the bracketing entropy the Gaussian families G[.]k with respect to the d™'^^ distance 
defined by 



^2max 



{{si,...,SK),{tl,...,tK))= sup d {Sk,tk). 

l<k<K 



Here, [{t^ , . . . , t^^), (if, . . . , fj^)] is a bracket containing (si, . . . , sk) if 

yi<k<KyyeE, t^{y)<Sk{y)<t+{y). 
As it can be of interest on its own, we state it here: 
Proposition 4. For any 5 G (0, 



where 



and 



V[^,,L.,D*,A,]f = C^*V^,pE-hCL,Vz,,pE+CD,V£),p^-hCA.VA,pE With < 



C^o = CLo = CDo = CAo = 
C^,^ = Cl^ = CD,^ =CAk=K 
^ = Cl = CD = CA = 1 



T) — Pe(pe-I) 

.I'a.pb =Pe-'^ 



and < 



Vd.pe 

^A,PE 



PE In 1 + 108- 



=Pe 



In (l + 391n(^)pB) 



_ Pe(p 



{PE-I) (ln(2 + 255^1n(^)p^; 



)) 



where cs is an universal constant. 
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A Proofs for Section 2 (Single model maximum likelihood 
estimate) 

A.l Proof of Proposition 1 

Proof of Proposition 1. We first notice that, by convexity of the Kullback-Leibler divergence, 

JKLp^x{s,t) = ^KLx (s, (1 - p)s + pi) < ^ ((1 - p)KLxis,s) + pKL{s,t)) = KLxis,t). 

s — t 

Then let dA' = ((1 — p)s + pt)dX, the function u = r remains in [— 1/p, 1/(1 — p)], 

(l- p)s + pt 

1 , , sdA , MA , . 

and IS such that -— — = 1 + pu and — — = 1 — ( 1 — pju. 
dA' dA' 

Now, JKLJsdX, tdX) = ^KLisdX, (1 - p)s + ptdX) = ^KL((1 + pu)dX', dA') 

P P 



-KLx'(l + pu,l) = - f(l + pu) ln(l + pu)dX' 
P P J 



P 

and as j udX' = = - J {{1 + pu) ln(l + pu) — pu) dA. 

Similarly, (f{sdX, tdX) = (f{{l + pu)dX' , (1 - (1 - p)u)dX') = dl,{l + pu,l - {I - p)u) 

= 2 - 2 y/1 + pu^\ - (1 - p)udy = 2 j (l - y^l + {2p- l)u- p{l- p)u^^ dA' 

= 2 J (^1 - + (2p - l)u - p{l - p)u^ + {p- ^)u^ dA' 

Now let (f>(a;) = (1 + x) ln(l + x) — x, one can verify that <f>(a;) /x^ is non increasing on [—1, +cx)], 
so that Vu e l-l/p, 1/(1 - p)], <^{pu) = ^p^u^ > pt/^'ly p'u' so that 

(1 + pu) ln(l + pu) -pu> ({I + ^) In (l + ^) - ^) (1 - p)2«2 
>{l-p) (ln(^l+^)-p)«2 
Along the same lines, one can verify that Vm e [— l/p, 1/(1 — p)] 

1 - ^1 + (2p - 1). - p{l - p)u- + {p- \)u < "^""^Y " 



This implies thus 

- ((1 + pu) ln(l + pu) - pu) 
P 



> 

P 



p max(p!l-p) (^-^) 0^ {'^I^y ' - V1H2P-I)u-Pil-P)u^ + (p- ^).) 



> ^min(^,l) (in (l + - 2 fl - ^1 + (2p - 1)^^ - p(l - p)u^ + {p - hu 



P P \ \ '^-Pj J \ 2' 
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which yields the first inequality. 

Recall now that KL{sdX,tdX) > ^\\s — so that 

JKLp{sdX,tdX) = ^KL{sdX, (1 - p)s + ptdX) 
>^l|p(^-i)llli 

>flk-C,i- 

Combining this result with d\{s,t) > W\s — t\\\ ^ allows to conclude. 
For the third series of inequalities, 

d\sdX,tdX) = dl^{'^,l) = J (^f-l) tdX, 

while 

KL{sdX,tdX) = KLtdxC-,1) = ^ In ^tdX = J + tdX. 

It turns out that Va; e [0, M], 

1)^ < a;lna;-a;+ 1 < (2 + (In M)+)(V^ - 1)^ 

which yields the result. 

For the last bound, we use an idea of Kolaczyk and Nowak [33]: 

KL{sdX,tdX) = y^sln (^) dX 

= KL{sdX,tdX) = j (t-s + sln(^ 

and as log x <x — 1 

<J{t-s + s(^-l))dX 
and assuming that t does not vanish 

" )dX 



< 



\t-s\\lo 
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A. 2 Proof of Proposition 2 

For sake of simplicity, we remove from now on the subscript reference to the common measure A 
from all notations. 

Proposition 2 is split into three propositions: Proposition 5 handles the cases of bracketing 
dimension 0, Proposition 6 applies when one control the bracketing entropy of the models Sm 
while 7 applies using bounds on the bracketing entropy of the local models Sm{sm,(^)- Recall 
that Assumption (Hm)is the existence of a non-decreasing function cpm such that 5 ^(pmi^) 

is non-increasing and J ^ if[.]_rf(8„ {5, Sm{sm, c)) d(5 < (t>m{(^)- The complexity term is then 

defined by ncr^ where am is the unique root of —4>m{<^) = \/na. 

a 

For the c;ase of bracketing dimension 0, it suffices to show that the property holds for the 
local models as -H"[.],d®n ((5, cr)) < iJ[.]_(i®„ (5, ^m). 

Proposition 5. Assume for any a G (0, \/2] and any 5 G (0,cr] 
then the function 

satisfies the properties required in Assumption (Hm)- 
Furthermore, Dm = Vm- 

Proof. One check easily that 0,n is non-decreasing while S i->- ^(l)m{5) = \fVm is constant and 
thus non-increasing. 

Using the assumption on the entropy, 

y" \j ii\^4<&n Sm{Sm, o-)) = y \/ -f^H.d®" ((J A v^, ^^(s^, u A \/2)) dS 

< [ ^VZdS 
Jo 

Finally, the unique root of —(pm{o') = V^cr is — which implies Dm — naf^ = 

Vm- " □ 

If one is only able to bound the bracketing entropy of the global model, one has: 
Proposition 6. Assume for any 6 G (0, v^], 

1^ 



Then the function 



i?[.],d®n [S, Sm) < 'Dm ( -h In ^ 



<;^to(o-) = (t-v/Pto ( -y/Cm \/7r + \/ln 



<TAe-i/2 i • 



satisfies the properties required in Assumption (Hm)- 
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Furthermore, 25 satisfies 



1 + 1 + In — -2 I ^" 



where {x)+ = x if x > and {x)+ = otherwise. 
Proof. When a > e-V2, 



which is non-decreasing and such that b ^ \4>Tn{S) = s/Vm {VC^ + ^/tt) is constant and thus 
non-increasing. 

When a < e-V2, 



(t>m{(r) = (yyT^m VCm + \/7r+ Win 



and thus 



In 



1 



1 



" 2Jlni 



= I VC™ + \/7r + ^1= (^In ^ - 1/2 ) I > 



as In i — 1/2 > when a <e is thus non-decreasing. The function 



is strictly decreasing and thus non-increasing. 
Now 

^ y^i?[.],d»„(<5,'Sm)d<5 = ^ ^i?[.],d»4'^A\/2,5„)d<5 

< y"-\/i/[.],d«n(<5Ae-V2,5„)d^ 

W. ' 

We now rely on 

Lemma 1. For any cr G [0, 1], y^ln ^ d(5 < cr ^y^ln — + ^/tt^ . 



'Dm\ Cm + ln-dd 



In ^ d(5 - 



<7Ae-i/2 



-1/2 



d(5 



In 



-1/2 
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proved in Maugis and Michel [39] to deduce 



+ (^_^Ae-V2)+y^ln-^). 



VAe-V2 



a 



1 



/C™ + x/7r+ Win 



(7Ae-i/2 



This implies 



which implies by inserting this bound in the initial equality 

1 ' ' 



Cm < 



'Cm + \/7r+ Jin — 



;^(v^+V^)v^Ae-i/2 



In 



,1/2 



i|l+(ln 



Proposition's bound is obtained by squaring this inequality, using the inequality {^/a + \/bY 
2(a + 6) and multiplying by n. 

If one is able to bound the bracketing entropy of the local models, one can use: 

Proposition 7. Assume for any u G (0, \/2] and any 5 G (0, a] 

H[.]^d'S^ {5, Sm{Sm, O-)) < I?m + In ^ j . 

Then the function 

satisfies the properties required in Assumption (Hm)- 
Furthermore, satisfies 
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Proof of Proposition 7. By construction, the function (j)m is non decreasing while the function 

is non increasing. 
Now, 

J \J -ff[.],d»n {5, SmiSm, O-)) < j -H"[.],d»n {5 A V2, Sm{Sm,(^ A \/2)) 



< 



< 



/C„+ ^/In- ) dS^Vr, 



In - 1 dSVVr, 



We now use Lemma 1 to obtain 



By definition of (j>m{(^)- 

-^m{(^) = Vno- <^ (VC^ + Vn) \pD^ = spaa <^ a = 



Squaring this equality and multiplying by n yields the equality of the Proposition. □ 
A. 3 Proof of Theorem 1 

Proof of Theorem 1. For any function g, which may depend on the observed (X,, Y^), we define 
its empirical process P®"- {g) by 

n 

and its mean P®" {g) by 

P»"(5)=E[P®"(5)] =E 



i=l 



where {X[,Yl) is an independent copy of (Xj,!^). Note that when g depends on the {Xi,Yi), 
P®^{g) is a random variable. Let denote the recentred process P®"-{g) — P®"{g). 

Using this definition. 



KL®"{sQ,t) = P®" (-In 



So 



and JKLf^ {so, t) = P®- ( In ( - + 
^ ^ p \ So 



By construction, "Sm satisfies 



P^-{-lnSm)< inf P®-{-lnSm) + - 
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We let Sm be a function such that 



KL^-{so,Sm)< inf KL^-{so,Sm) 
We then define the functions kl{sm), kl{'Sm), and jkli^m) by 

fcKsm) = -ln( — ) /;;Z(sto) = -In ( — ) ^^/(sto) = -- In ' ''^ p)'5o + ps„ 



So / \ So / P V So 

By construction 

ps-msm))<ps-msm)) + l 

Since, by concavity of the logarithm, 

jkl{s^) = -1 In f (^'^^^■^•" + ^"-' '1 < -i f (1 - p) In ^ + pin - In ^ = kl{srr.), 



P \ So J P \ So So / So 

PS-{jkl{s^))<PS-{kl{sm)) + l 



n 

and thus 



V 



P'^-ijkliSm)) - V^-mSm)) < P'^-mSm)) ' {jkl{s^)) + ^ 

using the definition oi jkl{sm) and of kl{sm), we deduce 

JKLf- (so, s™) - < inf KL^" (so, s^) - i^S" (jklism)) + 1 + 

where JKLji"- [sQ.'Sm) is still a random variable. 

We now rely on a control on the deviation of t'^" {jkl{'sm)) through its conditional expectation. 

For any random variable Z and any event A such that P{A} > 0, we let E'^ [Z] = ■ It is 

sufficient to control those quantities for all A to obtain a control of the deviation. More precisely. 

Lemma 2. Let Z he a random variahle, assume there exists a non decreasing such that for 

all A such that F{A} > 0, E^[Z] < ^ ^In (^j^pj^^ ■ then for all x F{Z > *(a;)} < e'^. 

Here, we can prove 

Lemma 3. There exist three absolute constants k'q > 4, k[ and such that, under Assumption 
(H), for all m G M., for every ym > cTm o,nd every event A such that f{A} > 0, 



-jkl(Sm) 



9". Vn£v y^w) "vip \n^) 



Combining Lemma 2 and Lemma 3 implies that, except on a set of probability less than e ^, 
for any > cr 

-U^"{jkl{Sm)) ^ K[am , / X ^ 18 X 



2/m + «o<^^®"(so,Sm) ~ Vm ' ""^y ny^ p nyf^' 
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Choosing j/^ = + fi '^^^^ > 1 to be fixed later, we deduce that, except on a set of 

probability less than e~^, 

-u^n^jkiis^)) ^ k[ + k'^ ^ 18 



y^ + K'orf2®n(so,s„) - e ' e^p 
Thus, except on the same set. 



n n 

Let Epon > 0, we define 6'pon by ( ^i+^a + — 1^ \ k'q = Cp epcn with Cp defined in Proposition 1 
and cLS SjYi IS cL conditional density Cp(P'^'^{so,'Sm) < JKL®'^{sQ,8m)- Thus, we obtain 



(1 - ep,^)JKLf- {so, Sm) - V®- {kl{sm)) < inf ia®" (sq, s„) + 



< inf ia®"(so,s„) 



Kn n n 



ri rp. 
^ptpent'pen 



+ ^ + 

n n 

Let Ko = '""^''T^"™ , we obtain that, with probability smaller than e~^. 



JKLf"(so,s-) > — 



inf ia®"'(so,Sm) + /^oc^^ 



n n 



1-e, 



pen 



1 ^pen ^ 



which can be rewritten as, with probabiUty smaller than e ^, 



n n 



For any non negative random variable Z and any a > 0, E[Z] = a F{Z > az}dz so 



E 



1 ^pen V^mG'S'i 



iiif KL'^^{so,Sm) + noal, 

1 



+E 



z>0 

n n 



1 1 

- '^0^ 



As by construction ly^"^ {kl{sm)) is integrable and E [j^n"^ i^K^m))] = 0), we derive 



E[jKLf-{so,s;^)]< — 



inf (so, s,„) + koct4 ) + 



As 5kl can be chosen arbitrary small this implies 

1 



E [JKLf- (so, %)] < . inf ia®" (so, s^) + Koa: 



and thus Ci = y^^^^ — and C2 = t^^^^' — . 



1 - Epen n n n 



KO ^ , V + V' 



1 ^pen ^ 



□ 
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B Proofs for Section 3 (Model selection and penalized 
m£iximum likelihood) 

B.l Proof of Theorem 2 

Proof of Theorem 2. For any model Sm, we let Sm be a function such that 

ia®"(so,Sm)< inf ia«"(so,s„) + — . 

smeSm n 

Let m G A4 such that KL'^" (s, Sm) < +00 and let 



M' =im' gM 



n n n ) 

For every m' G A4', 

P-HkiM) + < P-^msm)) + + !^ < P-^ikiis^)) + + !L±Z/ 

n n n n n 

Since, by concavity of the logarithm, jkl{'Sm') < kl{sm')i 

and thus 

P^-{jkl{8m')) - y^-{kl{8m)) < P^-{kl{s^)) + - (ifcZ(?„0) ' + ^ 

using the definition oi jkli^rn') and of kl{sm), we deduce 

JAXf " (so, - i^S" ikKsm)) < inf KL^- (sq, s™) + - i^S" Ukl{?m')) - 

+ ^ + ^' + ^ 
n n 

Combining again Lemma 2 and Lemma 3, we deduce that, except on a set of probability less 
than e~^'"'~^, for any y^' > o^i, 



-V^^ijklCSm')) . K^Um' , , Xm' + X I?, Xm' + X 
< h Ko 



Choosing this time y^' = ^\l^m' ^ — ^ — ^\'<^ 6* > 1 to be fixed later, we deduce that, except 
on a set of probability less than e~^'"'~^, 

~vt-{3kl{^^,)) ^ 18 



Using the Kraft condition of Assumption (K), we deduce that if we make this choice of ym' for 
all models m', this properties hold simultaneously for all m' G A4 except on a set of probability 
less than Ee~^. 
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Thus, except on the same set, simultaneously for all m G M' , 

JKLf- iS0,Sm') - {kl{Sm)) < inf ia®" (so, Sm) + 



+ 



n n 

Let Epon > 0, we define 6'pon by + e^'^) '^o — C'p ^pen with Cp defined in Proposition 1 

and, cLS Sfji' IS cL conditional density Cpd^'^"(sojSm') ^ JKL'^"[so,'Sm'), we obtain 

(1 - epen) Jia®"(so,S„0 - V^-{U{s,r^)) < mf ia®"(so, S„) + 

_^ CpCpenj/^/ _ pen(mO _^ rj + rj' _^ Skl 
Kq n n n ' 

We should now study '^"'^T^^' - 2^^^^: 

CpCpeny^' pen(m') _ CpCpen^'pen /^^2 _^ + a: ^ pen(m') 



and by construction if we let kq = '^''^"T^"'" 

CpCpeny^/ pen(m') , a; Ko.pen(m') 

-, S «o (.J- J • 

Kg n n K n 

We deduce thus, except on a set of probability smaller than Se~^, simultaneously for any m' G 



(1 - epen)Jia«"(so,S„0 + (1 - ^)P?^ _ 

71/ 

< mf KL^"'{so,Sm) -\ l-Ko-H 1 



smESm n n n n 

As i'^"{kl{sm)) is integrable (and of mean 0), we derive that M = sup^/£_yvi' ^ is almost 

surely finite, so that as k,^^ < M for every m' G A4', one has 

S > ^ e-"-' > \M'\e~^ 

m'eM' 

and thus A^' is almost surely finite. This impUes that the minimizer m of P^" (— ln(sm)) + 2^^^^ 
exists. 

For this minimizer, one has with probability greater than 1 — Se~^, 

(1 - epen) J/a,«''(.o, + (1 - ^)P^ - U^'^ikli-Srr.)) 

rv Tl 

< inf i^L«"(so,.™) + ^^ + «o- + ^ + ^ 
smSSm n n n n 
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which yields by the same integration technique that in the proof of the previous theorem 



E 



JKL^-{s,,s^) + - 



1 — ^ pen(TO) 



< 



1 



inf KL®-{sQ,Sm) + 



pen(TO) 



+ 



As Skl can be chosen arbitrary small this implies 



E 



JKLf-{so,s;^) + - 



1 — ^ pen(TO) 



n 



< 



1 - e 
+ 



kq S _^ r] + r]' _^ Skl 
1 - epen n n n ' 



inf KL®-{so,Sm) 



pen 



pen(m) 



77 + 77' 



which is sHgthly stronger than the result stated in the theorem with Ci 



and C2 



l-Cpcn ^ l-Epon 

as the penalty of the select model appears in the right-hand side with a positive weight. □ 



B.2 Proof of Lemma 2 

Proof of Lemma 2. Let A={Z> ^{x)}. Either P{A} = < e"^ or 



E^[Z] < * In 



1 



¥{A} 



Now in the later case, 



^ ' P{Z>*(x)} - ^ ' 

Hence "^(x) < ^In (p^jj:)) which impHes x < In (^ppj) as * is not decreasing. This last 
inequality yields F{A} < which concludes the proof. □ 

B.3 Proof of Lemma 3 

We should now prove Lemma 3 which contains most of the differences with Massart [38] 's proof. 
Proof of Lemma 3. In this lemma, we want to control the deviation of 



So 



Note that for any Sm to be fixed later, if we let jkl{s) 
-jkl{sm) = -jkl{s} + {-jkl{sm)+jkl{s}) with 



1 {I - p)S0 + pSr, 



So 



, then 



-jkl{sm) + jkl{s} = - In 



(1 - p)sq + pSr, 



p \{l - p)sq + pSr, 

To control the behavior of these quantities, we use the following key properties of Jensen- 
KuUback-Leibler related quantities (a rewriting of Lemma 7.26 of Massart [38]) 
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Lemma 4. Let P be a probability measure with density sq with respect to a measure A and s, t 
be some non-negative and A integrable functions, then one has for every integer k>2 



In 



So 



So+t 



< 



A,2 2^-2 



where \\ ■ ||a.2 is the \-L? norm so that \\^/s—^/t\W ^ nothing but the extended Hellinger distance. 

In this lemma, P (g) stands for / gsodX i.e. the expectation with respect to the probabil- 
ity SodA. In our context this implies, conditioning first by (Xj)i<i<„, applying the previous 
inequality for each {so{-\Xi), s{-\Xi),t{-\Xi)) and then taking the expectation, that 



1 , I So 
- In 



t 



< 



kl /9d2®"(s,t) 



2 V 8p(l - p) 



k-2 



Wc now use 

Theorem 5. Assume f is a function such that 

p®„ (|/|2) < V 

VA;>3, P'^- {{f)l) <'^Vb''-^. 
Then for all A such that W{A} > 



E^(i.«"(/)) < 



2V 



In 



P{A} 



In 



P{yl} 



These bounds are sufficient to obtain a Bernstein type control for jkl(s) 

3 ^J^^^WJ^) ' 



2v/p(l - P) 



'In 



¥{A} J np \V{A} 



To cope with the randomness of s^, we rely on the following much more involved theorem 
(a rewriting of Theorem 6.8 of Massart [38]) 

Theorem 6. Let Q be a countable class of real valued and measurable functions. Assume that 
there exist some positive numbers V and b such that for all f G Q and all integers k>2 

Assume furthermore that for any positive number 6, there exists a finite set B{6) of brackets 
covering Q such that for any bracket [g~ ,g'^\ G B{5) and all integer k>2 



kl 



2rk-2 



Let denote the minimal cardinality of such a covering. There exists an absolute constant k 
such that, for any e G (0,1] and any measurable set A with V{A} > 0, 



supi.®"(/) 



<E + 



'In 



P{A} 



26, / 1 
— m 
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where 

Furthermore k < 27. 
If we consider 



1 / So + 
Qm{sm, (7) = < -jkl{sm) + jkl{s) = - In ' 



Jo ^ 



1 



€ <S'jjj(Sjji, (t) > . 



then the first assumption of Theorem 6 holds with V = 
We are thus focusing on 



2^2p(l-p) 



and 6 = ^. 



W„(s„,a)= sup t'®"(/)= sup i^S" {-jkl{sm) + jkl{7)) 
= sup^ J.®" (-jA;Z(s„)) + u^- (ifcZ(S)) 

SmGSm(s„,Cr) 

Now if is a bracket containing s, then 

P V So + T37,Sm / P V So + T37,Sm / P 



and 



9^ -9~ 



^so+if^J 



In 



So that 



fc! 



In 



as soon as ^rf (t J ^ ^ rpj^jg implies that, for any 5 > 0, one can construct a set of brackets 

2V2p(l-p) 

satisfying the second assumption of Theorem 6 from a set of brackets of rf®" width smaller than 
2\/2p(i p) ^ covering Sm(sm,cr)- That is 



H{5) < 



Theorem 6 can not be used directly with the set Gm{sm, as it is not necessarily countable. 
However, Assumption (Sep^) implies the existence of a countable family S'^ such that 



l_p*m 



~ 1 / So 

G'mism, 0-) = { -jkl{sm) + jkl{s} = - In — 

[ p yso + Y^^.! 
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is countable, and thus for which the conclusion of Theorem 6 holds, while sup^, ^~ i/®" (/) = 
supg ^~ ^'n"(/) with probability 1. We deduce thus that for every measurable set A with 
W{A}>0, 



, ^ K 1 /■Sv2p(l-p) 

where = ; 



3 



, ^(p + 2^2^13^) ^ 2 V2p(l - p) 3a / 

H ^[■],d'S>n —j===,bm,[Sm,(T) 

3k 1 ('^'^ I Z 

2eV2p(l - p) Vn Jo ^ 

2(2+3^) 

V 2V2p(i-p)' / c ^~ ^^ 

H -n[.],d»n (,cr, Ato^StojCtJJ 



Choosing e = 1 leads to 



]E^[W„(?„,ct)] < £;+ — , '^^^ Wlnf -At I + — In, ™, 



where 



2(- H 3°^ ) 

2^/2p{^- p) Vn Jo ^ " n 

By Assumption (Hm), if we assume e 5m, /q"^ {<^: Sm (sm, cr)) A ndS < 0to(c) ) 

as well as (5 ((5, ^'^(Sm, c)) is non-increasing. This implies 



Inserting these bounds in the previous inequality yields 

" 2 V2p(i - p) \p V2p(i - p) y '^^^ 

3k + ( ^ + 3cr \ d,„(a) I 4n(g) 

2v'2p(i-p) U ^2^(1 - p) y 



< 



As (5 (5 ^(l>m{5) is also non-increasing, so is (5 (5 '^(p„i{5). The definition of can be 
written 

Q^ 0m^) -j gQQg as a > iTto- Indeed under this assumption. 



rewritten as the equation '''"li""^-' = l. The right-hand side of the previous inequality is thus an 



E < i 3k _^ 4 _^ 3cr \ (f)^{<7) 
- \2^2p{l-p) P ^2p{\ -p)) 
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and 



3k 



3cr \ 4>m{<j) 



21a 



In 



/ 1 



2^2p{l-p) P y/2p{l -p)J 2^p{l-p)^n]l \r{A} 

pn"" \¥{A} 



Using now a < \/2, we let k'{ = [ ^J^":^ +^+ ^ 
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2^2^(1-^) ' P ' ^ypil-p) J - [^2^p{l-p) ^ 



K < 27, Ko = — , SO that Vcr > am, 
^VpC^-p) 



sup z/®" {-jkl{sm) + jkl(S)) 



<<^ + ^,/lnfJ:^U^ln^ 1 



Thanks to Assumption (Scp„), wc can use the pealing lemma (Lemma 4.23 of [38]): 

Lemma 5. Let S he a countable set, ^ G S and a : 5* — >■ such that a(s) — inf^gg a{s). Let Z 
be a random process indexed by S and let 

B{a) = {sG S\a{s) < a} , 

assume that for any positive a the non-negative random variable supgg3(„) {Z{s) — ^(s)) has 
finite expectation. Then, for any function tjj on ]R+ such that tp{x)/x is non-increasing on ]R+ 
and 



E 



sup (Z(s)-Z(S)) 

seB(cr) 



< tp{a), for any a > a^, > 0, 



one has for any positive number a; > a* 

^ Z{s) - Zjl) 
S + a2(s) 



E 



sup 

s6 



< Ax '^ip{x). 



With S — Sm, s = Sm G Sm to bc Specified with a{s) = d ®''{sm, s) and Z{s) = —jkl{s). 
Provided ym > fm, one obtains 



E^ 



)+,fc/(? ) 



< 4k' 



„<l>m{ym) , O- /^^ ^ 1 



1^ ^ 



Now using again the monotonicity of (5 (5 ^0m(<5) and the definition of am, ^Vm > fn 

(j/m) , 4>m{o-m) 

< = = 



and therefore 
E^ 



^„ , jkl{sm) + jkl{s 

rn ) 



Vnym Vna 

4:K'{am 4:K[ 



< 



1 



We can now choose Sm such that for every .s„i G S'm 

•(so,Sm) < (l + ed)(i^®"(so,Sm) 
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so that 



For this choice, one obtains 



sup 



-jkl{sm) + jkl{ 



< 



+ + 



4k'' 



16 
pny?, 



In 



/In 
1 



/ 1 



F{A} 



which imphes 



-jkl{Sm) + jkl{Sm) 



y2„ + 2(2 + ed)d2»"(so,s„) 



^ 4<V^ _^ 44' 



/In 



1 



We turn back to the control of —i'^''{jkl{sm))- Our Berstein type control yields 

3 ^yd^®n(so,Sm) ' 



or for any i/m > and any k' > 0: 



/In 



1 



In 



1 



E^ 



yl + K'^d^^n(S0,Im) 



< 



< 



1 2 

+ 

3 



In 



1 

F{A} 



In 



1 



2 



/In 



/ 1 



In 



/ 1 



We derive thus 



E^ 



-jkl{sm) + jkl{s^) 



yl, + 2(2 + ed)d^®^ (sq, J y'L + K'^d^^n ^so,Sm) 



\F{A}J pnyl \V{A} 
i^S"{jkl{s„,)) 



In 



18 , 
+ ^In 



Let k'^ such that k'/ = 2(2 + ed)/(l + e^), using d^^"{so,Sm) > d^®- (sq, s„)/(l + e^), 



-jkl{sm) + jkl{ 



^yl + 2{2 + ed)d^^-{so,Sm)J ' + t,'/d^<^'^{so,s^) 
and thus 



-jkljSm) 

2/^ + 2(2 + 6^)^28" (so, s™) 



E^ 



-jkljSm) 
Vli + Ko'^^'^"(So,Sm) 



< 



h 



1 



/In 



/ 1 



18 



In 



/ 1 



\F{A}J nylp \P{A} J ' 



where = 2(2 + ed)/(l + e^), k'^ = Ak'I and = 4k'2' + 3/(4y'p(l - p)^;;). 



□ 
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B.4 Behavior of the constants of Theorem 1 and Theorem 2 

We now explain the behavior of the constants kq and C2 with respect to C\ and p. As shown in 
the proof, if we let Cnen = 1 — 7^ then Ci = — - — and C2 = — = i^oCi so that it suffices 
to study the behavior of kq. 

Now Ko is defined as equal to '^p'^p'^^^p"" with 6'pon the root of -\- ^ k!q = CpCpen 

where we use the constants appearing in Lemma 3. This implies 

CpCpenOlen ^2 / '^'l + «2 , 18 \ , ,.18 

«0 = = ^pen —n + ^ = ^pen ('^1 + K^) + 

Kq V ''pen "pcnP/ P 

Solving the implied quadratic equation ^pen('«'i + ^2) + = ^pen '^''^J'"" yields 

_«oK+4)(^rT^5fgtF+i) 



pen - 



p'-pen 



and thus 



'^o = + — 



Now 



K[=AK'i = A{ , +-+ . ^ = , ^ (3/c^^+12 + 16J^ 

\^2V2p(l-p) P ^/p(W)V V P 

and using that for any e > 0, once ea is small enough, 2 > > 2(1 — e) 

42 3 1 3 



' 4VKW)«d " 8vp(W)(1 - e) V ' 8(1 - e) 



so that 

(.; + 4f<^(3.y2 + 54+^ + 16 
Now using 4 < Kq < 4(1 + e) 



^ 4(l + e)(3.^/2 + 54+3^ + 16y^) ( ^1 + ^jgfjk^^ ^ 18 

2p(l - p)Cp 

1 

< 



This implies that Kq scales when p is close to 1 proportionally to 

1 P 
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and thus explodes when p goes to 1 as well as when epen goes to 0. 

Note that, as it is almost always the case in density estimation, these constants are rather 
large, mostly because of the crude constant appearing in Theorem 6. Indeed let cfm denote the 
supremum over all models of the collection, the right hand side of the previous bound on kq can 
already be replaced by 

1 

- /0)epen 

X (2(1 + .) (3»V2 + 42 + 6^/2,« + ^ + ' [f^^^. + l) + 18^,(1 - 

which is much smaller than the previous quantity as soon as aj^ is much smaller than \/2, which 
can be ensured in the models of Section 4 provided we limit their maximum dimension well below 
n, for instance to n/ln^(n). 



C Proof for Section 4.1 (Covariate partitioning and condi- 
tional density estimation) 

Proof of Proposition 3. We start by the UDP case, as we stop as soon as ^ > 2~'^^-' < ^, 
J < " 2 and thus there is at most 1 + " 2 different partitions in the collection, which allows 
to prove the proposition in this case. 

Proofs for the RDP, RSDP and RSP cases are handled simultaneously Indeed all these 
partition collections are recursive partition collections and thus correspond to tree structures. 
More precisely, any RDP can be represented by a 2^-ary tree in which a node has value if it 
has no child or value 1 otherwise. Similarly, any RSDP (respectively RSP) can be represented by 
a dyadic tree in which a node has value if it has no child or 1 plus the number of the dimension 
of the split (respectively 1 plus the number of the dimension and the position of the split). Such 
a tree can be encoded by an ordered list of the values of its nodes. The total length of the code 
is thus given by the product of the number of nodes N{V) by their encoding cost (respectively 

[eII bits, [ '"^l^a'"'' ] bits and [ '"^j+^^^^ l + [^]). As this code is decodable, it satisfies the 
Kraft inequality and thus, using the definition of Bq , 



E 



It turns out that the number of nodes N{V) can be computed from the number of hyperrect- 
anglcs of the partition which is also the number of leaves in the tree. Indeed, each inner 
node has exactly 2^ children in the RDP case and only 2 in the RDSP and RSP case, while, in 
all cases, every node but the root has a single parent. Let d = dx + dy i^n. the RDP case and 
d = 1 in the RDSP and RSP case then 2'^{N{V) - \\V\\) = N{V) - 1 and thus 



with Cq^^ as defined in the proposition. Plugging this in the Kraft inequality leads to 
■Pes*'-*' -Pes*'-*' 
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Let now c > Cq , 



J2 e-'^-^o' 'll^ll < ^ 
and as ||P|| > 1 



< g ^0 f'^o = e e " = Sq e " 

which concludes these three cases. 

For the HRP cases, it is sufficient to give the uppermost coordinate of the hyperrectangles 
ordered in a uniquely dccodablc way based on the following observation: assume we have a 
current list of hyperrectangles, the complementary of the union of these hyperrectangles is either 
empty if the list contains all the hyperrectangles of the partition or contains a lowermost point 
that is the lowermost corner of a unique hyperrectangle. Furthermore, this hyperrectanglc is 
completely specified by its uppermost corner coordinates. Starting with an empty list, an HRP 
partition can thus be entirely specified by the list of uppermost corner coordinates obtained 
through this scheme. 

This leads to a code with x dx [i^] bits for each partition that satisfies a Kraft inequality 

HRP(;r) 

AT ^ \ HRP(;f) 

iNow tor any c > Cq , 

^HRp(;f)ll^ll .^-^ _|.^_^HRP(;v)j^HRP(;t)||p|| _^HRP(;t) p,HRp(;t) , 



It is then only a matter of calculation to check that if c is larger than 1 in the UDP and RDP 
cases and larger than 2 In 2 in the other cases then all these sums can be bounded by 1. 

□ 



D Proof for Section 4.2 (Piecewise polynomial conditional 
density estimation) 

Theorem 3 is obtained by proving that Assumption (Hqp d) and (Sqt^^d) hold for any model 
Sqv while Assumption (K) holds for any model collection. Theorem 3 is then a consequence 

of Theorem 2. 

One easily verifies that Assumption (Sqt'^d) holds whatever the partition choice. Concerning 
the first assumption. 
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Proposition 8. Under the assumptions of Theorem 3, there exists a D^, such that for any model 
Sqv^jj Assumption (Hqv^jj) is satisfied with a function (p such that 



V 



with = 2D^ + 2n. 

The proof relies on the combination of Proposition 2 and 
Proposition 9. VS'q-p^D) Vsqp^d € -Sq^^d? 



/I „2 ^\ 

-«■[.], d®n {S, S'qp,d(sq^,d, o-)) < I>QP,D y^i^ ||gp|| + -D* + In - J . 



Remark that we also use the inequality 
/ /l 



■In 



D^ + ^/^\ <ln 



2D* + 27r. 



By using Proposition 3 for both V and Q, we obtain the Kraft type assumption: 

Proposition 10. Under the assumptions of Theorem 3, for any collection S, there exists a 
Ci, > such that for 

\ -Riev J 

Assumption (K) is satisfied with ^ e~^a''.° < 1. 

The complete proof is postponed after the one Proposition 9. 
D.l Proof of Proposition 9 

For sake of simplicity, we remove from now on the subscript reference to the common measure 

A from all notations. Wc rely on a link between || • ||2 and || • ||cc structures of the square roots of 
the models and a relationship between bracketing entropy and metric entropy for || • ||oo norms. 
Following Massart [38], we define the following tensorial norm on functions u{y\x) 



E 



n 



i)\\2 



and 



E 



n 



2 

I) Woo 



As the reference measure is the Lebesgue measure on [0,1]^, > ||m||2'^"- By definition 

d'^'^{s,t) = llv^ — v^llf" and thus for any model Sm and any function Sm S Sm 



{s.{ 



u e 



< 



If \/Sm is a subset of a linear space \/Sm of dimension "D^, as in our model. 
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so that one can replace, without loss of generality, y/s^ by and use 



< 



Using now 



> 



il2 ' 



one deduces 



< 



< 



As for any u, [u—S/2, u+S/2\ is a (5-bracket for the || • ||^" norm, any covering of |?i G \fS^ 
by II • 11^" ball of radius 6/2 yields a covering by the corresponding brackets. This implies 

H[.ld»n{6,Sm{Sm,(^)) < ^ll.ll®™ (^^i {u G ^/SZ jjwjjf" < C^}^ 

where Hd{5, S), the classical entropy, is defined as the logarithm of the minimum number of ball 
of radius S with respect to norm d covering the set S. 

The following proposition, proved in next section, is similar to a proposition of Massart [38]. 
It provides a bound for this last entropy term under an assumption on a link between || • 
and II • II o®" structures: 



Proposition 11. For any basis {(f>k}i<k<v„v v^m such that 



T>„ 



v/Sgk^-, \\Y,Mk\\l^''>m\l 



k=l 



let 



1 \\Ek=iMk 

/VZ, ll/Jlloo 



r-m({0fe}) = 



sup 



and let he the infimum over all suitable bases. 
Then f t„ > 1 and 



u & \/ S„ 



<<7}J <i'm(Cm + ln^) 



with Cm = In (KocTm) and < 2\j2iTe. 

In our setting, using a basis of Legendre polynomials, we are able to derive from Proposition 11 
Proposition 12. For any model of Section 

1 



rQP,D < n (VDd + lv/2Dd + 1) sup 



SO that^SQV € iSq-p D, 

{6, 5'qp^d('Sq^,d, c^)) < Vq-p^-o (Cq-p^jj + In - j 
with Cgrji = In {K-oaropji) and Koo < 2\/27re. 
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One easily verifies that 
1 



j 1 if all hyperrectangles have same sizes 

K^^Q" VW\\^/[^\ ~ Ia/S^ otherwise. 



Remark that when ★(A') = UDP(A'), -k{y) = UDP(J^) and Qi is independent of TZi, all the 
hyperrectangles have same sizes and that the corresponds to the arbitrary limitation imposed 
on the minimal size of the segmentations. If we limit this minimal size to ^ instead of - this 
factor becomes n. 
Let 



i?* = In Loo n (VDfc + W^^k + l) 



we have slightly more than Proposition 9 as Vsgp d € 'S'g^,D> 



2 WQ-P 



In + + In I j otherwise 



for the same size case 



D.2 Proofs of Proposition 11 and Proposition 12 



Proof of Proposition 11. Let {4>k)i<k<T>^ be a basis of vSV^ satisfying 



V/3 eR^'", 



>ll/3|li- 



Note that for /3 defined by VI < fc < ^fc = 1 



fc=i 



> 



fe=i 



> \mi=Vm=vM\ 



so that rm{4') > 1- 

Let the grid ^to(<5, ct): 



|/3 e M^" 



Vl<fc<D„,/3fce^ — —Zand min W-0'\\oo< — ^ — 



By definition, for any u' G vSW» such that ||u'||2 < cr there is a /?' such that u' = J2k=i l^'k^k 
and ||/3'||2 < c. By construction, there is a /3 e 0to(^, c) such that 

||/3-/3'||oo< 



2\/^f„((^) 



Definition of implies then that 

I'm 'Dr, 



X] Mk - X] 



k=l 



k=l 



<fm{cp)V^\W-l3'\\ 
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The set {e /3fe</'fe|/5 G Gm{S,cr)^ is thus a | covering of |u e 
norm. It remains thus only to bound the cardinaUty of Gm{S, f ). 



< 



for the 



Let Gmi^j cr) be the union of all hypercubes of width 



/xv:f„(0) 



centered on the grid Grn{S^ cr), 



by construction, for any /3 €E Gm{S,(y) there is a /?' with ||/3'||2 < cr such that — /3||oo < 



■ As ||/3'-^||2< V2?r, 



Vol (Gm{S,<T)) = \Gm{S,a) 



loo, this implies \\(3\\2 < cr 



- We then deduce 



< Vol N /3 e 



ll/3||2<a + 



?'m('/>) 



rm{4>) . 

Vol({/3eK^-|||/3||2<l}) 



and thus 



\Gm{S,a)\ < 1 + 



-Dr, 



and as > 1 and Vol ({^ e M^^lll/^lh < l}) < (l^) 



P^'"/2Vol({^eR^'"|||/3||2<l}) 

■Dm/2 



\Gm{5,<j)\ < 



which concludes the proof. 



2\/27rer,„((/))(T 



□ 



Instead of Proposition 12, by mimicking a proof of Massart [38], we prove an extended version 
of it in which the degree of the conditional densities may depend on the hyperrectangle. More 
precisely, we reuse the partition V e S^"^^ and the partitions Qi e S^^^ for TZi gV and define 
now the model Sqp^y) ^ the set of conditional densities such that 

8{y\x)= 4r,(y)l{(=.,y)e7^,xj 

where is a polynomial of degree at most D(7?.;^^) = i (7?.;^ j.), ... ,0^(^(7?. j^^,)^ which 

depends on the leaf. 
By construction. 



dim(52-D)= E E \{{^^iKk) + ^) -1 



The corresponding linear space ^/Sqp^ is 



deg (p^.^)<Ti{nl,)\. 



45 



Instead of the true dimension, we use a slight upper bound 

2^a- D = E E n + 1) = E n + 1) 

Note that the space Sq-p^j^ introduced in the main part of the paper corresponds to the case 
where the degree ^{TZ^^) does not depend on the hyperrectangle TZ^i.- 

Proposition 13. There exists 

SUp^x^ggP ridll (Ei3^<Da(TCf,) V2£>d + l) 1 

rQ-p,T> < / sup = 

such that ysQV ^ G Sqt> u, 

-ff[.],d®n {S, Sq-p ^j:){sq-p ^r>, a)) < Vq-p^j^ (CQ-p^ri + hi-^ 

with Cqv^z) = In («^cx)^q'',d) ('■iT'd i^oo < 2\/27re. 

Proposition 12 is deduced from this proposition with the help of the simple upper bound 



As 



inf^x,eQ-ntiVD,(^-,) + l 



< [| max \/2(Dd + 1) 



once a maximal degree is chosen along each axis, the equivalent of constant of 3 depends 
only on this maximal degrees. Assumption Hqp d holds then, with the same constants, simul- 
taneaously for all models of both global choice and local choice strategies. Obtaining the Kraft 
type assumption. Assumption (K) is only a matter of taking into account the augmentation of 
the number of models within the collection. Replacing respectively A*^''^^ by A^^"^^ +ln I'D'^l for 
global optimization and B^'"'^^ by B^^'^^ +ln|X'^| for local optimization, where jP^I denotes 
the size of the family of possible degrees, turns out to be sufficient as mentioned earlier. 

Proof of Proposition 13. Let Ld he the one dimensional Legendre polynomial of degree D on 
[0, 1] and Gd = \/2Z) + ILd its rescaled version, we recall that, by definition, 

V£)eN, \\Gd\\oo = V2D+1 and V(A^')eN2, j GD{t)GD'{t)dt = 5d,d' 

Let D e N''^, we define Gd as the polynomial 

GDi,...,Da^{y) = GdAvi) X ••• X Goi^iVdY)^ 

by construction 

V£)eN'^^, \\Gd\\oo= n VSi'd + 1 

l<d<di' 
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and 



Now for any hyp errect angle 7^;^^, we define Gjy'-" {x, y) = -j^=Gd{T'^'''' j^^^^x j (a;, 

where T is the affine transform that maps T^'i^ into [0, l]"^ so that 



— n ^21?^ + 1 



fel l<cf<dr 



and 



/a:e[0,l]<*x >/j/e[0,l]<'i' 

Using the piecewise structure, one deduces 



E 



-T^ X -T^ X 



E 



= E 



E l{XieKi} 
\Tli\ ^ 



D<r>(ni,nf^) 



dydx 



E 



4XiG7?,,} 



E E 

-R-l.eQi D<T,(nf.) 



^ |7e,| ^ ^ 



The space s/Sqp^ is spanned by 



D 



but also by the rescaled (j)'^'-" = ^^^^^^^ Gp'" where = ^ Er=i ^^^^f^- For these 



functions, one has 



E E ^n'' 



= E 



E 

i=l 



X X 



E E Pn'^'Pn^i^i^-) 
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E E 



^ 1, 



= E E 

For II • I loo type norm, 





1 











E E 



E 



n 



E 



Ee 



= -yE 



i=l 



-T^ X -T^ X 



< 



n 



EE 



E ^{-^ieTij sup sup ^ 



< 



n 



E^ 



\ 



< 



E ll^^lloo 



E ii^^i 

r.<D(7^,>;,) 



2 










2 








1 










CSO 



Now 



E li^^i 



dY 



/ 



W V-^Dd + 1 < sup n E v/2i?d + 1 



= 1 \ Dd<Dd(7if J 



while 
V 
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This implies 



V 



< 



< 



53 |7^;| sup 



, / ^ I'-M sup ^ 

Titer 

suPk,>$ ^,eQ^ ridli ED,<Da(7^,'; „) V2£'d + 1 



< 



sup 



V 



Proposition is then obtained by a simple application of Proposition 11. 

D.3 Proof of Proposition 10 

Proof. By construction 

e-^e^>D = E E E e-^'e-p.-o 



E E E ^' 



Pes; 



By Proposition 3, one can find > max(l, cl^'^\ c^'"'^'') such that 



J2 e-c*(A;'^>+B*'^'||Qdl) <i 



and 



■Pes;'-*' 



Plugging these bounds in the previous equality yields 

J2 e-^e^.-< Yl e-^*K"'+^«*"'"^lO <1. 



■Pes*'-*' 
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Proposition holds with the modified weights for polynomial as 

^ g-e.l„|13"| ^ |2,M|l-c, 

as soon as > 1. □ 

E Proofs for Section 4.3 (Spatial Gaussian mixtures, mod- 
els, bracketing entropy and penalties) 

As in the piecewise polynomial density case, Theorem 4 is obtained by showing that Assumptions 
(R-p^K,g), {S-p,K.g) and (K) hold for any collection. 

Again, one easily verifies that Assumption {S-p^K,g) holds. For the complexity assumption, 
combining 2 with a bound on the bracketing entropy of the models of type 

-H'[-],d^w(<5, Sv,K,g) < d\Ta{Sv,K,g) + In ^ 
one obtains 

Proposition 14. There exists a constant C depending only on a, L_, L+, A_ and A+ such that 
for any model S-p^K,g of Theorem 4 Assumption {H-p^K,g) is satisfied with a function (j) such that 

I 



^v,K,g < \2(yc + ^y + 1 



In— -2 I dim{Sp^K,g)- 

y e(^y/C + ./^j dim{Sr,K,g)J 



For the Kraft assumption, one can verify that 
Proposition 15. For any collections S of Theorem 4, there is a c* such that for the choice 

xr,K,g = c* (a*^'^) + b;^''^\\V\\ + {K-1)+ Ve) , 
Assumption (K) holds with ^ Q-^v,K,g < i 

As for the piecewise polynomial case section, the main diSiculty lies in controlling the brack- 
eting entropy of the models. A proof of Proposition 15 can be found in our technical report [15]. 

We focus thus on the proof of Proposition 14. Due to the complex structure of spatial 
mixture, we did not manage to bound the bracketing entropy of local model. We derive only 
an upper bound of the bracketing entropy iJf.j.d®" ((^j 5'-p,i<:,g), but one that is independent of 
the distribution law of (^i)i<i<„: the bracketing entropy with a sup norm Hellinger distance 
fpnv ^ y'rf2sup^ H^.^^a<"-^{5, Sv ,K,g) , where d^^up jg defined by 

d2™P(s,t) = supd^ {s{-\x),t{-\x)) . 

X 

Obviously d^^^^ > d"^^" and thus H^.^^d''^p{6, S-p^K,g) > -ff[-],d®n ((5, •Sp.K-.a)- This upper bound is 
furthermore design independent. 

Proposition 14 is a direct consequence of Proposition 2 and 

Proposition 16. There exists a constant C depending only on a, L_, L_^_, A_ and A+ such that 
for any model S-p^K,g of Theorem 4: 



i?[.],dsup(^, S-p^K,g) < dim{S-p^K,g) + In ^ 
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E.l Model coding 

Proof of Proposition 15. This proposition is a simple combination of Theorem 3, crude bounds 
on the number of different models indexed by [/U* L^i, D^, A J ^ and [yu^L^D^ A^] and of classical 
Kraft type inequalities for order selection and variable selection (see for instance in the book of 

Massart [38]): 

Lemma 6. • For the selection of model order K, let Xk = {K — I), for c > 

1 



E 



1 - e- 

K>1 



For the ordered variable selection case, E = span{et}i£/ with /={!,... ,Pe}, let 6e = Pe, 
for c > 



ec-i - 



• For the non ordered variable selection case, E = span{ei}ig/ with I C let 

eE=(} + e + \a^)pE,forc>i, 



^ l_e-e • 



Using that there is at most 3x3x3x3 different type of models [/i* D^^ A*]^ and 2x2x2x2 
different type of models [/i*L*D* AJ, and 3^ x 2^ = 1296, we obtain 

SK,T,ges KeN'Ves^ e [/i^* l, d, a*]^ l* d* a,] 



E 



^e--^ sup ^ ^ 



[/i* L« D. A,]^f L, D, A. 



< 1296- — ^ SSe-'=*^o I 

1 - e-"* " 



1 if £^ is known, 

if E is chosen amongst 
^^j^j- spaces spanned by the first 

coordinates, 
2e-(c*-i)(i+in2) if is free. 



Choosing c* slightly larger than max(l,Co) yields the result. □ 
E.2 Entropy of spatial mixtures 

Proof of Proposition 1 6. While wc use classical Hellinger distance to measure the complexity of 
the simplex Sk-i and the set Qe^^ we use a sup norm Hellinger distance on Q% defined by 

d^-"^^ ((Si, . . . , Sk), {tl, . . . = supd^isk, tk). 

k 
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We say that [(si, . . . , sk), {ti, ■ • • , t/c )] is a bracket of ifWl < k < K,Sk < tk- 

Using a similar proof than Genovese and Wasserman [24], we decompose the entropy in three 
parts with: 

Lemma 7. For any S e (0, V2], 

We bound those bracketing entropies with the help of two results. We first use a Lemma 
proved in Genovese and Wasserman [24] that implies the existence of a universal constant Cs 
such that 

H[.]AS/3,Sk-i) <{K-1) (cs + In : 
Lemma 8. For any 5 € (0, \/2], 

H,U5/3,Sk-i) <{K-1) (cs^_, +ln\ 



with Csk_, = -J— lnK+ ^.^ ^. ln(27re) + ln3\/2 
A — i z(iv — ij 



Furthermore, uniformly on K: Cs^-i ^ lii2 + - ln(27re) + ln3\/2 = Cs 

We then rely on Proposition 4 to handle the bracketing entropy of Gaussian /sT-tuples collec- 
tion. It implies the existence of two constants C[*]. and C[*] depending only on a, Z/_, Z/+, A_ 
and A+ such that 



(5/9, G%) < dim(ef ) (^Cm* + In ^ 



As (lmi{SK,v,g) = II^IK-^ ~ 1) + dim(5^) + dim(5gj_), we obtain Proposition 16 with C = 
max(C5,C[^]*,C[*]). □ 

E.3 Entropy of Gaussian families 

Instead of Proposition 4, we prove the slightly stronger 
Proposition 17. Let ac > f and 



Then for any 6 G (0, a/2], 
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where 



and 



V, 



= = CDo = CA„ = 
= = CDk = CAk = K 

Cl = CD = Ca = 1 



I>L = 1 

-n _ pe(pe-i) 

■^'D.pe — 2 



V, 



: In 1 + 



and < 



where cs is a universal constant. 
Furthermore, for any pe < P 



with 



Vl,p^ =ln(n-f/3«ln(^)pB) 

Vd,p. = '-^^ (^^^ + (in {i26P.^pe))) 

Va,p, = {PE - 1) In (2 + In (^) pe) 



V < c v 

Vl,pe < Cl,pI'l,pe 
Vd,pe ^ Cd,p2?d,pe 



and, uniformly over K, 



Cd,p 

^A,p 



In 1 + 



18/3k ap 



/ 39 



2 In CS+ (in ( 126/3„^p 



, , „ 255 ^ A+ , 

ln( 2+— /3«^ln 



Vr„ T, n A 1^ < max ( Cn „ 



C^.^K' + CLi + CDi + CAi (K' " 1) 



L,p 



V^i^' + cl; + cd; ^^^(^^ + ca; (/^' - 1) 



JC'(JC'-l) 

— 2 — 



c^,K' + Cl: + cd: "^" -'^ + ca; - 1) 



+ C 



A,p" 



K'(K'-l) 
2 

CA;(i^'-l) 

c^; K' + cl; + cd; ^^^i^ + ca; (if' - 1) 
< max(C^,p,CL,p,C£),p,CA,p)2?[;a,,L,,D,,A,]^^ 
where the max is iafcen over every Gaussian set type and every number of classes considered. 
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Proposition 4 is obtained by setting k = 1 and using the crude bounds 1/9 < 7^ < 1/4, 

Proof of Proposition 17. We consider all models Q^^^ q^jk at once by a "tensorial" construc- 
tion of a suitable 5/9 bracket collection. 

We first define a set of grids for the mean /i, the volume L, the eigenvector matrix D and the 
renormalized eigenvalue matrix A from which one constructs the bracket collection. 



For any 5^, the grid gij.{a,pE,S^) of [-a, a]P^: 



0. 



iY 



• For any ^l, the grid G-l{L_, L+,6-l) of [L_,L+]: 

gi.{L_,L+, 5l) = + SlYIq e N, + 5l)^ < i+} . 

• For any i5d, the grid GuiPEj^u) of SO{pe) made of the elements of a i5D-net with respect 
to the II • II2 operator norm (as described by Szarek [45]). 

• For any 5a, the grid 5a(A_, A+,p£, (5a) of ^(A_, A+(l + 5a), Pe)- 
gA{X-,X+,PE,5A) = {A& ^(A_, A+(l + 5A),PE)\yi < I < PE,^9i e N, = A_(l + 5a)''}. 

Obviously, for any n G [—a, a], there is a /i e Qij,{a,pE, 5^) such that 

IIA-mII^ <Pe51 

while 

\g^{a,pE,5f,)\< (^1 + 2^^ <max(^2P-,(^Q ^ 

In the same fashion, for any L in there is a L e 5L(-^-,-t'+,<5L) such that {l+5h)~^Lj^ < 

L < Lj^ while 

\0.iL_,L„5.)\<l+^^^^. 
If we further assume that < 3^ then ln(l + Jl) > {f^L and 



13 In 

\gi^{L_,L+,5T^)\ < 1 + 



12(5l 

By definition on a fo-net, for any D £ SO{pe) there is a £> € 5d(P£;, ^d) such that 

\/x,\\{D-D)x\\2 < Sr,\\x\\2. 
As proved by Szarek [45], it exists a universal constant cs such that, as soon as < 1 



\GuipE,5T))\ < Cs 



1 ■ 
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where (pe-i) jg intrinsic dimension of SO{pe)- 

The structure of the grid ^a(A_, A+,pb, (^a) is more complex. Although, looking at condition 
on the — 1 first diagonal values, 



In 

\gAiX-,x+,PE,SA)\ < 1 2 + 



> \ \ Pe-1 



ln(l + Sa) 

where pe — ^ is the intrinsic dimension of A{X-, X+,Pe)- If we further assume that Sa < then 
ln(l + Sa) >U^a and thus 

85Inf^)^'^"' 

\gA{X-,x+,PE,SA)\< 1 2+ g^^^-^ 

A key to the succes of this construction is the following approximation property of this grid 
proved later: 

Lemma 9. For A e A{X-, X^,pe) there is A e ^a(A_, A+,pe, (^a) such that 

IA-I-a-UkSaXz". 

Define c^^ = clq = cdq = caq = 0, c^^ = cl^ = cd^ = ca^ = isT, = cl = cd = ca = 1- 
Let fK,ii^,,pE be the application from (M^'^)'"* to defined by 

H> (/io,i, ■ • ■ ,Mo,k) ifM* = Mo 

(/Xl, . . . H> (/Xi, . . . ,/Xif) if/U* = /Xif, 

jj, ^ ill, . . . , H) ifM* = M 



and /i<-,L* (respectively fK,T>t,pE and fK,K„,PE) be the similar application from (M+)^'"* into 
(K+)^ (respectively from (S'0(33£;))'=°* into (^(^(pe))^ and from (^(0, +oo, peW'^* into (^(0, +o< 
By definition, the image of 

([-a,ar)^- X {[L_,L+]r-* x (SO(pe))^°* x (^(A_, A+,pe))'=** 

by (/if,/i»,pE ^ Ilk. ^pe •8' /if,D,,pE <8i/if,A,) is, up to reordering, the set of parameters of all 
ii'-tuples of Gaussian densities of type [/i*L*,D*, A,t]^. 

We construct our 5/9 bracket covering with a grid on those parameters. For any i^-tuple of 
Gaussian parameters ((/iti, Si), . . . , {hk, ^k)) and any S^, we associate the iiT-tuple of pairs 

( ((1 + /«^e)-^"$^„(1+5,)-ie,, (1 + «^Er $^„(l+fc)Ej , • • • , 

((1 + K^s)-^"$^^,(l+5,)-iE^, (1 + «'5E)f-$^^,(i+fe)E^) ) • 

We prove that, for 7^ and defined in Proposition 17 and any k > |, the choice 

pe' 18/3«pB-12' ° ^ 126/3« A+pB - 84' 9/3«p£-8 
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Lemma 10. Let k > |, 7k 



and 



is such that the image of 

by .fK,tj.^,,pE ® .fLK,.,pE ® .fK.D,.pE 55 /i<:,A» is a set of parameters corresponding to a set of pairs 
that is a 5/9-bracket covering of l, d, A*]^ for the d'^^^ norm. 
Indeed, as proved later, 

3(K-i) (k-I) 

2(1 + f )(1 + i)(l + ^) ' 2(1 + f )(1 + 1). 
^ cosh(|) + |. For any < S < \/2, any pe>^ and any < 9;g~^j 

Let {jl,L,A,D) e [-a,a]P^ x [L_,L+] x >l(A_,+oo) x SO{pe), define S = LDAD', 

t-ix) = (1 + «^E)-^"$^,(i+5,)-iE(a;) = (1 + '«^E)^"$^,(i+,,)£(a;). 

t/ien [f",^'''] is o (5/9 Hellinger bracket. 

Furthermore, let {n,L,A,D) e [-a, a]^® x [L_,L+] x A{X-,\+) x SO{pe) and define S 
LDAD'. If 

'U-fi\?<PEl.L_\_^5l 
{l + ^)-^L<L<L 



yl<^<PE, \A-l-A-l\< 1 1 



14 A+"S 



Dx-Dx\\<^^^S^\\x\\ 



then t-{x) < < t+{x). 

By definition of d"^'^^, this impHes that our choice of S^, Sl, Sd, Sa and is such that every 
ii'-tuple of pairs of the collections is a 5/9-bracket and they cover the whole set. 
The cardinality of this ^/9-bracket covering is bounded by 



V 




PB . 



< 



X CS 




A_ 5 



126;8„ X+ PE / 

851n(^) 



PE-1\ 



84 



126/3^ A+ PE 



18al3^PE 




39/3Kln(§±)pE 



pe(pe-i) \ 

























25 



255/5K^ln 




PE-1\ 
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So 



if[.j_d„ax (5/9, ,L. .D. ,A.]f ) 



< c^.Pb In 1 



18 (3 ^apE 



L_A_ 



39 



In 



+ CA, (Pi. - 1) (in (2 + ^ In p^) + In 

which concludes the proof. 

E.4 Entropy of spatial mixtures (Lemmas) 

Proof of Lemma 7. This a variation around the proof of Genovese and Wasserman [24] . 



□ 



Let { [tTi , ttJ^] , . . . , 
simplex Sk-i- Let 



} be a minimal covering of (5/3 Hellinger bracket of the 



''E,,N, ' ''E±,Ne 



be a minimal covering of (5/9 sup norm Hellinger bracket of Qe,k and { t^j_ ^ , t~^j_ ^ 
be a minimal covering of S /9 Hellinger bracket of Ge^ ■ By definition, In Nsj^_^ = H^.^ di^/'^i Sk-i), 
\x,Ne,k = H[.],a'^.4S/9,gE,K) and InTV^.. = H[.]^a{S/9,gE^)- 
By construction. 



K 



K 



TZieV \k=l 



TZiev \/c=i / 
1 < i[ni] < Ns^_„l <j< Ne,k, 1 < Z < A^bx 



is a covering of model SK,v,g of cardinality exp (|P|ff[.]_d((5/3, + ff[.],dma,x((5/9, ^^s.if) + H\^-\,d{5 /9,Qe^)) 

It remains thus only to prove that each bracket is of sup norm Hellinger width smaller 
than 5. 

Using 

Lemma 11. For any 6 Hellinger brackets [t~{x),t'^{x)], if for any x [u~{x^y),u^{x^y)] is a 6 
bracket and 5 < \/2/3, then [t~{x)u~{x,y),t'^{x)u^{x,y)] is a 3(5 Hellinger bracket. 

we obtain immediately 

{tE,H,E (•) *B^^ (•) .4..,. (■) (•)) < 9(V9)^ = (V3)^. 



Let 



++ 



denote the corresponding (5/3 Hellinger bracket. 



By definition, 

\A;=1 / TZ,ev \fc=l / y 
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K K 

^+ ++ 

3,1 



< sup £ E <fc 



\k=l k=l 



Seeing TTi^kgk,j,i{y) as a function of k and y, we can use 

Lemma 12. i^or any brackets [t~{x),t~^{x)] and if for any x [u~{x,y),u'^{x,y)] is a bracket 
then 

^y(^J i~{x)u~{x,y)dXx{x), j t'^{x)u+{x,y)d\:^{x)^ ^ {i~{x)u~{x,y),t'^{x)u^{x,y)) 
to obtain 



Vkig-p \fe=i / ni&v \k=i J / 

< sup (vrr^ i;^T; (y) , TT+fc t++ (y)) 



and then using again Lemma 11 
< 9(5/3)' = S^. 

□ 

Proof of Lemma 11. 

{t~ (x) u~ {x, y) , t+ (x) w+ {x,y)) 

^/t+{x) u+{x,y) - ^yt~{x) u-{x,y)^ dXx{x)dXy{y) 



Vt+{x) (^y^u+{x,y) - y^u-{x,y)^ + (^y^t+{x) - y^t-{x)^ \Ju-{x,y)^ d\x{x) dXy{y) 
(^t+{x) (^^/u+{x,y)- (yt+M- \/F(^)'u-(a;,y) 

+ 2^/t+{x) (v/i+M - ^/F{x}) Vu-{x,y) (^^/u+{x,y) - ^/u-{x,y)'^ ^ dX^{x) dXy{y) 
= j t+{x) d^{u-{x,y),u+{x,y))dX,,{x) + d'^{t-{x).t+{x))s\iY> j [x , y) dXy{y) 

+ 2j ^/t+{x) (^^/t+{x)-^/t-{x)') J Vu-{x,y) (^^/u+{x,y) - ^/u-{x,y)') dXy{y)dX^{x) 

- iJ j t+ix)dXx{x) supd{u~{x,y),u'^{x,y)) + d{t~{x),t~^{x)) supW j u-{x,y)dXy{y)\ . 



Using 



Lemma 13. For any 5-Hellinger bracket [t Jt dA < 1 and J t~^dX < [S + Vl + S'^Y 
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we deduce using 5 < 

dP{t-{x)u-{x,y),t+{x)u+{x,y)) < (s + ^/l^ + l)^ 5' 

< (^2/3+ Vl + 2/9 + 1) 



2 

2 



Proof of Lemma 12. 

dli I t~{x)u~{x,y)dXxix), / t+{x)u'^{x,y)dXx{x) 



□ 



= 11 t'^{x)u+{x,y)d\^{x)d\y{y)+ I / t~{x)u~{x,y)d\^{x)d\y{y) 

-2 J t+{x) u+{x,y) dXx{x) ^ J t-{x) u-{x,y) dXx{x)dXy{y) 

< t+{x)u+{x,y)dXxix)dXy{y) + / / t~ (x) u~ {x,y) dX^^x) dXy{y) 

Jy J X Jy J X 

-2 / ^/ 1+ {x) u+ {x,y)^/ 1- {x) u- {x, y) dX^ {x) dXy {y) 

J y J X 

< dl,y {t~{x) u-{x,y),t+{x) u+{x,y)) 



Proof of Lemma 13. The first point is straightforward as t is upper-bounded by a density. 
For the second point, 



□ 



t+ dA = j {t+ - r) dX + j t-dX< j (yt+ - VF) (yt+ + VF) dA + 1 

<2 I (yt+ -VF^Vt+dX+l<2( f (yi+-VFy dX^ ' I I t'^dX \ ' +1 



jt+ dX< 25 (^j t+ dX^^ +1 
Solving the corresponding inequality yields 



J t+dx< (^s+^/T+py 



□ 
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E.5 Entropy of Gaussian families (Lemma) 

Proof of Lemma 9. We first define gi as the set of integers such that 

VI < i < A_(l + SaY^ < Ai,i < A_(l + (5a)^-+^ 
By construction 5, e N and A_(l + ^a)^* < A+. Now as A. 



■PB,PE — TTPE-l A ' 



nff + '^a)§'+i nrf ^-a + say^^ """^ - m;' a_(i + 6ay^ 

There is thus an integer d between and — 2 such that 

{1 + Sa)-''-' ^, . (X^^a)-^ 



nff A-(i + say^ - mi' a_(i + say^ ■ 

Let gi = Qi + 1 if i < d and gt = gi otherwise, then 

VI < i < PB, A_ (1 + 5a)^*-' < Ai^i < A_ (1 + SaY'^^ 
which impUes A_(l + ^a)^' < (1 + '^a)A+. Now 

1 ^ (1 + ^-^ 

ngr' A-(i + say^ ner' a-(i + say^ 

and thus 



which impUes 



A- < ...... < (1 + Sa)X+. 



mil A-(1 + ^a)^' 
Thus the diagonal matrix A defined by 

VI <l<PE-l,\^^ X-{1+SaY' 

and A^E.VE = t-ipe^-^ - — belongs to Ga{^-, ^+,Pe,Sa)- Furthermore, we can write for any 
lli=i 

1 < « < - 1 

Ai,iil + (5a)-^ < Ai^i < Ai^l + Sa) 

which implies 

Arl(l + 5A}-'<A-l<A-l(l + SA} 

and thus 

I A"/ - 1 < max (1 + 5a - 1, 1 - (1 + 5a)-i) = A-.i max (^Sa, 
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Along the same lines, 
thus 

and 

\KlpB - ApIpb\ < ^pLpb^A < ^=''5a. 



□ 



Proof of LemniM 10. Wc first prove that [t is a S/9 Hcllingcr bracket. As (1 + ^ — 

(1 + Sj:)^^Y.^^ = ((1 + S^) — (1 + Sz,)^^) S^"'^ is a positive definite matrix, one can apply 

Lemma 14. Let ^(^^^^Si) and ^(^2.^2) ^'""^ Gaussian densities with full rank covariance matrix 
in dimension pe such that Sj^^ — is a positive definite matrix, for any x G MP'^ 



^ exp Q {in - /is)' (S2 - El) ^ (/xi - /i2)^ • 



proved by Maugis and Michel [39]. This yields using eventually k > | 



t-{x) _ (1 + Kfe)-^^ ^^,(i+5s)-is(a;) ^ 1 / (1 + < (1 + -^s)"^ 



Concerning the Hellinger width, 

d^{t~,t+) = j t~{x)dx + j t+{x)dx-2 j ^t-{x) ^t+{x) dx 
= {1 + k5^)-p^ + {1 + k5j:Y^ 

-2{l + k5^)-^-'\1 + k5^Y-/^ j y*^,(i+,,)-if;(x)^*^,(i+,,)^(x)dx 

= (1 + k5^)-^- + (1 + k6^Y- - (2 - • 

Using 

Lemma 15. Let <1?(^i,Ei) ond $(^2,22) <wo Gaussian densities with full rank covariance matrix 
in dimension pe, 

($(^„E,),*(M2,E2)) = 2 (1 - 2P-/' lEiS^r^/" |Er^ + E2 exp (^-^ (/zi - /i2)' (Si + (/xi - ^2)^ 
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also proved in [39], we derive 

(f{t~,t+) = j t~{x) + j t+(a;) da;-2y ^/t-{x)^/t+{x) dx 

= (1 + k5^)-p^ + (1 + k5^Y^ - 2 2f«/2 ((1 + 5^) + (1 + <5s)^^)"''"^' 

= 2 - 22P«/2 ((1 + <5s) + (1 + 5^)-^^'''^ + (1 + /«<5e)-p^ + (1 + k5^Y'' - 2 

Combining 



Lemma 16. For any < (5 < and any ps > ^, let k > ^ and = \J cosh(|) + 5, i/ 

1 1 1 

< < -• 

Pe& 6 

and 

Lemma 17. i^or any d S N, /or any (5e > 0, 

2-22-^/2 ((l + fe) + (l + fe)-^)"'^'<^. 
Furthermore, if dS^ < c, then 

(1 + k6^Y + (1 + Kd^y'^ -2<K^ cosh(Kc)d2(5|. 

with c = g yields 

We now focus on the proof of t {x) < < ^^{x). As 

Lemma 18. Under Assumptions of Lemma 10, (1 + 6-£,)T,~^ — T,^ and S"^ ~ (1 + <5s)S~^ fl^'e 
positive definite and satisfies 

Vx e Mf^,x' ((1 + <5e)E-^ - S-i) X > iL-i-^(5E||x||2 

4 A-|- 

Vx e MP«,x' (S-i - (1 + X > ^.^ ^ ^ . L-i-^(5s||x||^ 

we can apply Lemma 14 on Gaussian density ratio to both 



to prove that they arc smaller than 1. 
For the first one, using 



< (1 + k5^)-p- I J ^^^1^ exp Q(m - A)' ((1 + 5^)t - S)-^ (m - fii 
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Now 

((1 + <5e)S - E)"' = ((1 + 5^)t (S-i - (1 + <5e)-^E-i) E)"' 
= (1 + (S-1 - (1 + 

and thus 

(m - A)' ((1 + - E)-' - ^) < (1 + s^r'Lz'xz'^^^^^Lx+s^'i-'xz'Wn - Hf 



Now as by construction, 



4 



one obtains 



^ + , 1 



^1^ TT^^^ exp(^-7«fe 



It is thus sufficient to prove that 



1 + K(5e 
or equivalently 



- exp -7k()e < 1 



3 

Now let 



fi{S^) = In I 1 + _ I ^ ^^(^ ^ ^^^^ _ 1 ^^(^ + _ i ln(l + ife) 
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and thus provided k > f , as (5e < ^ 



4' — 6 

3 



(l+f)(l+i)(l+5^) 
Finally, as /i(0) = 0, one deduces 

which implies thus 



or ^ij„s{x) < t+{x). 

The second case is handled in the same way. 



< (1 + -p (2 - - + ^-)"^)" - '^)) ) 



Now as 

(S - (1 + = (S ((1 + - S-') (1 + 

= (1 + ((1 + S^)t-' - S-i)"' S-i 

and thus 

(m - A)' - (1 + (/X - A) < (1 + (5s)L"'Ali4ZA+,5s'il'Ali||/i - ^if 

< (1 + 5E)il'Ali4^5^Vi=;7«i^-A-^(5| 

A_ A-|_ 

<4pB7«(l + (5s)& 

one deduces 

< 7:;- — tSt- exp -4j5i57«(l + ^Eif^E 



All we need to prove is thus 



exp (27,(1 + < 1 

1 + koe 



64 



or equivalently 



Let 



= In (^^=1 = + /^<5e) - I ln(l + fc) 



/2(^e) 



K \ _ f i5s + K — I 



1 + 1 + (1 + «;(5e)(1 + (5e) 

and thus provided k> |, as (5e < g 



(l+t)(l+g) 
Finally, as /2(0) = 0, one deduces 



f^^^^^ > i/ ^ ^ 27«(1 + > 27«(1 + 5e)^e 



which implies 



or equivalently t^{x) < <f>p,s(-T). 

Proof of Lemma 16. A straightforward computation yields 

1 S 1 \/2 111 

& < -t; < , < < -. 

QPkPe Pe 9^(1)^ + 1 6 

Proof of Lemma 1 1. 

2 - 2 2'^/^ ((1 + ^e) + (1 + fc)-^)-'/' = 2(1- ^ j 



= 2 1 



(cosh(ln(l + fe)))"''/') 



= 2/(ln(l + (5E)) 



where /(a;) = 1 — cosh(a;) '^1'^. Studying this function yields 

f\x) = ^sinh(:c)cosh(x)-'^/2-i 
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f"{x) = ^cosh(a;)-'^/2 _d^d^-^^ sinh(x)2 cosh(a;)-<^/2-2 



and, as cosh (a:) > 1, 



rw < I. 

Now as /(O) = and /'(O) = 0, this imphes for any a; > 



■'^'-22 - 2 2 
We deduce thus that 

2-22-^/2 {{l + S^) + {l + 6^)-y^^ <^d^\n{l + Sj:)f 
and using ln(l + ^s) < 

2 - 2 2'^/^ ((1 + 6^) + (1 + < 

Now, 

(l + rt(5E)'' + (l + Kfe)"''-2 = 2(cosh(dln(l + Kfe)) - 1) = 2^ (dln(l + /tfe)) 
with g{x) = cosh(a;) — 1. Studying this function yields 

g'{x) = sinh(a;) and g"{x) = cosh(a;) 

and thus, as g{0) = and g'{0) = 0, for any < x < c 

x^ 

g{x) < cosh(c)y. 

As ln(l + hSy) < dcJs < c impUes dln(l + k5^) < kc, we obtain thus 

(1 + KS^f + (1 + KS^y^ - 2 < cosh(Kc)d2 (ln(l + wfe))^ < cosh{Kc)d'^6l:. 

Proof of Lemma 18. We deduce this result from a slightly more general: 



□ 



Lemma 19. Let > 0. 

Let {L,A,D) e [L_,L+] x ^(A_,A+) x SO{pe) and {L,A,D) e [L_,L+] x ^(A_,+oo) x 
SOipE) , define S = LDAD' and t = LDAD' . 
If 

'{l + Sj^y^LKLKL 
yi<i< PE, \Arl - A-J I < SaXZ' 
yxGW, ||£)a;-£)a;|| < (5D||a;|| 
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then (1 + - and T.-^ - (1 + S^y^t.-^ satisfi 



es 



Vx e W^,x' ((1 + (5e)S-i - E-i) X > ((<5s - Sj^)X+^ - (1 + fc)Al^ (2fo + ^a)) || 
yx e W^,x' (S-i - (1 + a; > (^eA^^ - Al^ (2^d + -^a)) \\xf 

Indeed Lemma 16 ensures that (5e < g. Hence, if we let (5l = |(5e and Sa = Su = jgj^^ 
bounds of the previous Lemma become Vx G M.^'^ , 

x' ((1 + - S-i) X > ((fe - <5l)A;1 - (1 + (5s)Ali (2(5d + <5a)) 

> ((fe - ^fe) a;^ - (1 + S^)XZ'S^^S^^ Wxf 

while VarelKP*^, 

x- (E-> - (1 + &)-'£-■)! > (fcA+' - AZ' {2Su + Ua)) llif 



Proof of Lemma 19. By definition, 



v' ((1 + - S-i) a; = (1 + ^iri|I>^a:p - L-'J2^i\^i 

i=l i=l 

Pb Pe 

= (1 + ^ir/lD^a;^ - (1 + fe)^-^ ^ ^"^'1^^ 

i=l i=l 

Pe Pe 

+ (1 + S^)L-'J2^^'\^i^\' - (1 + E 
j=l 1=1 

Pe Pe 

+ (1 + fe)L-i ^ A7l\Dix\' - L-i ^ ^7/ \Dl 



x\' 



j=i 1=1 

Similarly, 

Pe Pe 



(S-i - (1 + X = L-^ ^ l^i^l' - (1 + fe)-'^-' E A"- l^i^ 

i=l i=l 

Pe Pe 

= Y: \D'A' - (1 + E l^i^ 

i=l i=l 

Pe Pe 

+ (1 + <5e)-^l-^ E - (1 + E 



.X? 



i=l 
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PE 



PE 



+ (1 + Yl - (1 + fe)"'^"' E 



Now 

Pe 



i=l 



PE 
i=l 



i=l 



1/2 



PE 



1/2 



u=l 



< Al^fc||a;||2||a;|| = Al^25D||a;||^ 



Furthermore, 



j=i 



i=l 



PE 

<5AAli^|7?:x|2 = (5AAli||xf. 



We then notice that 

PE PE PE 



while 



Pe 



i=l 



Pe 



i=l 



Pe 



i=l 



>(l-(l + fe)-i)L-V'-"' 



> 



l|l„l|2 



1 + 



We deduce thus that 

x' ((1 + - S-i) X > (fe - (5L)L-iA;i||a;||2 - (1 + ds)L-'XZ' (25d + 2<5a) ||xf 



> - (5l)A;1 - (1 + fe)A:i (25d + <5a)) ||x|| 



and 



x' (S-i - (1 + ^e)-'S-1) X > - (1 + 6j:)-'L-'XZ^ (2<5d + <5a) 

L 
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